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NEW PROBLEMS FOR OLD SOLUTIONS* 


HusBert E. BroGpEn 


PERSONNEL RESEARCH BRANCH 


THE ADJUTANT GENERAL’S OFFICE} 


Some of the methodologies that have become standard tools in psycho- 
metrics suffer from neglect. They are taken too much for granted and are not 
given the attention that seems appropriate to the important role they play 
in research advances. I propose to make some suggestions which may, in 
a modest way, assist in alleviating this difficulty. In a very general way I 
would like to suggest that effort expended in examining a variety of restate- 
ments of a methodological problem may lead to new methodologies of real 
value. 

In a sense then, I am suggesting that we look for new problems in areas 
where old solutions are available, or possibly, for problem restatements 
where the application of an old solution may have become too automatic 
and too uncritical. 

Since a discussion of all of this in general terms would tend toward 
triteness, I plan to proceed in part through the use of examples with the 
thought in mind that a number of such examples (bearing upon similar 
problems) may be of some additional value in suggesting a generalized ap- 
proach to certain classes of problems. For simplicity, I will avoid problems 
associated with sampling error and limit the discussion throughout the paper 
to cases involving very large samples. 

An article by Guilford and Michael entitled ‘‘Approaches to Univocal 
Factor Scores”’ illustrates the kind of problem restatement that I have in 
mind. A solution to the problem of estimating factor scores, providing esti- 
mates that are best in the least square sense, has been available for some time. 
However, it has been frequently observed that the least squares estimates of 
orthogonal factors tend to intercorrelate substantially. Having in mind, 
possibly, this apparent defect of scores estimated through the least squares 
method, Guilford and Michael suggest as an alternative approach scoring 
or weighting procedures designed to yield a univocal factor score—a score 
having variance in only one common factor, its remaining variance being 

*Presidential Address to the Psychometric Society, September 4, 1957. 


{The opinions expressed are those of the author and do not reflect official Department 
of the Army policy. 
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attributable to errors of measurement plus possible specific variance. This 
restatement emphasizes the reduction of bias or contamination and relegates 
accuracy of measurement to a secondary role. 

While I do not wish to consider this problem in greater detail or attempt 
to describe the solution, I do have a further point regarding the justification 
of the problem restatement, which is pertinent here and will be pertinent in 
the examples to be discussed later. In considering possible methods for 
estimating factor scores, an early question might well be—how will the 
factor scores be used, or what kinds of conclusions will be drawn and what 
kinds of decisions will be made as a result of their use? It seems desirable 
that the statement of the problem should be phrased so as to maximize the 
likelihood that such conclusions or decisions will be correct. This may often 
lead to a number of alternate problem statements since the same problem 
statement may not permit correct conclusions in research studies with various 
objectives. 

Let us consider, for a moment, the factor score problem in relation to 
these questions. I realize that many investigators seeking estimates of factor 
scores may have purposes in mind which are consistent with the use of the 
least squares estimates. They may wish, for example, to report estimates 
of the factor scores to the individual subjects taking a set of tests. To illus- 
trate the role of anticipated conclusions or decisions in the statement of the 
problem, let us consider a class of studies in which the investigator seeks 
to extend his knowledge of a set of factors by relating the factor score esti- 
mates to a set of new variables. This will permit us to relate the conclusions 
resulting from such a study to the statement of the problem. 

An investigator with such a purpose in mind will draw no conclusions 
from the individual scores and has no proper direct interest in such scores. 
He will draw conclusions only after correlating the factor scores with the 
additional set of variables under investigation. His conclusions will relate 
specifically to the correlations between the factor estimates and this additional 
set of variables. A problem restatement such as that of Guilford and Michael 
could well have been tied directly to this intended use of the factor estimates. 
If the factor estimates are collinear with the factor, the conclusions can be 
proper and valid; if they are not collinear, the correlations will be biased 
(as estimates of factor loadings) and the conclusions will be directly and 
adversely affected. 

A consideration of the problem of estimating a composite criterion 
score from unreliable components will provide a further opportunity to 
illustrate the kind of problem restatement I have in mind. Suppose that a 
set of criterion components are available which are deficient only in that 
error of measurement is present. The problem, in general, is the selection of 
weights for the components that will yield the best estimate of the true 
composite. 
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A least squares solution to this problem is possible and has often been 
proposed as best. If, however, we reconsider the statement of the problem 
and examine particularly the nature of the conclusions to be drawn when the 
estimate of the criterion composite is used for validation research, it can be 
shown that the least squares solution is irrelevant and gives wen nega- 
tively related to those provided by the proper solution. 

Assuming that the criterion composite is to be used for validation studies, 
the validity coefficients or the partial regression weights of the predictors 
are of central importance, since these are basic to the conclusions deriving 
from the validation study. The following restatement of the problem is 
suggested after considering the intended use of the criterion: what set of 
weights for the fallible criterion components will insure that, in a later vali- 
dation study, the validity coefficients (or the partial regression weights) 
obtained against the estimated composite will be the same as those that would 
be obtained if the true composite were available? 

R. H. Gaylord and I have considered this problem and a solution has been 
achieved. An interesting aspect of the solution is the relationship between 
the magnitude of the weights and the reliability of the components—the less 
reliable the component the larger the weight, when the components are in 
standard score form. To understand this point note that—when the compo- 
nents have unit variance—if more error variance is present less true score 
variance will remain. Hence, a heavier weight is needed if the true score 
variance of an unreliable component is to be proportionately represented in 
the over-all criterion composite. 

With a least squares solution the opposite is true: the greater the error 
in the criterion component the lower the least squares weight. The least 
squares methods yield a composite that has maximum correlation with the 
true composite but which is biased for the purpose of validation research. 

The approaches to this problem and to the problem of factor score 
estimation have much in common. In each instance the least squares solution 
was, perhaps, the more obvious one. The problem restatement could in each 
case be derived from an examination of the way in which the solution would 
affect the conclusions of research studies in which it was applied. There is a 
further point of similarity. The criterion estimation problem might have 
been stated: what estimated criterion composite will be collinear with the 
true composite? 

A restatement of the problem of item difficulty distribution may throw 
still further light on the general point I have in mind. A number of investi- 
gators have in various ways studied the relation of tests to underlying ability 
and have shown for various conditions the characteristics of items and item 
difficulty distributions that will yield the most efficient measurement of 
underlying ability. While efficiency of measurement has been defined in a 
number of ways, all of the definitions resemble the least squares definition 
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in that all are concerned with maximizing some index of the degree of relation- 
ship between the test and underlying ability. Ferguson and others have 
discussed the problem associated with difficulty factors and have stressed the 
way in which correlations among tests may be distorted as a function of 
similarity or dissimilarity in the difficulty distribution of the items 

Since the phenomenon of difficulty bias in correlations is basic to the 
problem restatement I wish to propose, further explanation of this bias seems 
desirable at this point. It is well known that if the p-values of two dichoto- 
mous variables are similar, the phi coefficient will tend to be low and this 
will hold although the tetrachoric correlations between all pairs of items 
are equal. Now, with test scores the same phenomenon is evident, particu- 
larly if the tests are homogeneous in difficulty. Two tests, each homogeneous 
in difficulty, will correlate more highly if the difficulty level of the two tests 
is approximately the same than if the difficulty levels are divergent. Many 
test types proposed as efficient measures of underlying ability in the least 
squares sense have items homogeneous in difficulty. Hence, the correlations 
among such tests and between such tests and other variables are also subject 
to difficulty bias—probably more so than with tests having a greater spread 
of item difficulty. 

I do not mean to disparage tests designed as efficient measures of under- 
lying ability or to imply that the statement of the problem leading to this 
solution is defective. I merely wish to suggest that an alternate problem 
statement is possible and, for certain purposes, may be more desirable. 

Where a contribution to subject matter knowledge is the object of an 
investigation, it seems proper that elimination of bias is all-important and 
reduction of error of measurement is of secondary importance particularly 
since, with knowledge of error of measurement, methods of estimation are 
often available and appropriate which will make allowance for the attenuat- 
ing effect of error. 

Thus the problem, as restated in a general manner, might be to determine 
the optimum difficulty distribution for a test to be used for investigating 
the relation between the ability it measures and other variables. What we want 
is a test which will yield the same pattern of correlations with other variables 
as would underlying ability, regardless of the nature of such other variables. 
If the correlations between the test and other variables are the same, after 
correcting for the effect of error, as the correlations between underlying 
ability and such variables, the conclusions or decisions reached will be the 
same. 

I cannot offer, with proof, an exact statement of the over-all problem 
and an accompanying solution. While the foregoing discussion is possibly 
sufficient in view of the theme and scope of this paper, a further problem 
statement and a possible solution may be of some interest. The justification 
of these further developments must remain largely intuitive. 
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A brief indication of the major assumptions and limiting conditions 
may be helpful before continuing with the new problem statement. Obvi- 
ously, in this brief discussion it is not feasible to state these in full. Underlying 
ability is defined as a perfect normally distributed measure of the ability 
common to the dichotomous items. The problem is limited to tests in which 
all items have the same biserial correlation with underlying ability. Ability 
and error are assumed to be the only determiners of the item responses. 

If I rephrase the problem statement and ask: what item difficulty 
distribution will yield a test score such that the bivariate frequency surface 
of the test score and underlying ability is normal, the statement then appears 
to be more precise and seems to be a more feasible starting point in a mathe- 
matical development leading to a demonstrated solution. 

This more precise restatement is, I believe, logically equivalent to the 
prior and more general restatement of the problem. From this second restate- 
ment it follows that the test is a simple linear function of underlying ability 
and error, and that the test is described thus through its entire range—given 
only the product moment correlation between the test and ability. It also 
follows, then, that the correlation between ability and any other variable 
can be estimated, given the correlation between this variable and the test 
and, of course, the correlation between ability and the test. A linear model 
equivalent to that used in factor analysis is applicable and the estimate is 
the product of the above two correlations. It is emphasized that this model 
is believed to hold regardless of difficulty biases that may be present in the 
other variable. Hence, when the bivariate surface and ability is normal, it 
seems reasonable that the test can be used in place of ability and, with correc- 
tion for error of measurement, the conclusions reached through the use of the 
test are the same as those that would be reached had underlying ability been 
available. 

The item difficulty distribution that I have in mind as a possible solution 
to this problem is perfectly rectilinear, with the item difficulty index expressed 
as baseline values of a normal curve. To achieve this distribution, items 
would be selected with difficulties of 0, +.1, —.1, +.2, —.2, +.3, —.3, ete. 
In theory, such a distribution would place items at equal difficulty intervals 
ranging from plus infinity to minus infinity. 

Now, the general point of this paper had to do with the value of examin- 
ing alternate statements of the problem, and I believe these three examples 
illustrate this general point. I have been developing as a second general point 
the need for examining the conclusions or decisions to be made when the 
methodologies under consideration are to be applied, and the desirability 
of reasoning from such conclusions to a justification of the methodology. 
This point has been stressed sufficiently and discussed in relation to each of 
the three examples. 

The distinction between the kind of problem statement that leads to 
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a least squares solution (or something akin to such a solution) and the alter- 
nate problem statements that we have considered deserves some extra 
comment. 

We have noted that if, in each case, the scores were to be used for practi- 
cal estimation of an individual’s standing, the least squares solution would 
likely be satisfactory. 

If interest did not center on direct use of the scores and if the scores 
were to be used as a means of arriving at further conclusions through addi- 
tional research, alternate problem statements pointing toward reduction of 
bias have been suggested as more pertinent and more acceptable. 

The added suggestion I wish to make at this point deviates from the 
central theme of the paper and relates to the above similarities in the three 
examples. I am suggesting merely that the above-noted distinction may 
extend beyond these three examples. In additional problems where the Ieast 
squares solution has been accepted as best, a close examination of the problem 
in relation to the decisions to be made when the resulting method is used may 
again suggest an alternate problem statement. In other words, the particular 
distinctions between the least squares problem statement and the reduction 
of bias problem statement may have more general value. 

The latter portion of this paper will be directed toward possible sources 
of confusion between different classes of problems or problem statements 
rather than toward restatements of problems as such. 

A very general distinction in the methodologies widely used in psychol- 
ogy is relevant in several ways to the present discussion, although little of 
what I have to say is really new or different. I am speaking of the distinction 
between a correlational approach and approaches primarily based on con- 
trolled experimentation. I would like to discuss these two very general 
approaches in relation to classes of practical decisions properly stemming 
from empirical evidence. I am choosing cases involving practical decisions 
so that certain points can be made most clearly, not because the points I 
wish to make are necessarily limited to cases involving practical decisions. 

If the practical decision is a choice between administering or not admin- 
istering a given treatment, it is well recognized that a controlled experiment 
is properly used to demonstrate the effect of the treatment. Knowledge of the 
effect of the treatment then becomes a major factor in deciding whether 
or not the treatment will be used. I have no real comment here. I believe 
that few would hold that a correlational study—without experimental con- 
trols—is proper backing for such a decision. 

While the class of decision for which correlational evidence is appro- 
priate is fairly well recognized in practice, it is somewhat more difficult to 
find a clear statement enjoying widespread agreement in discussion of scientific 
method. I should like to suggest at least one type of practical decision where 
a correlational design is clearly pertinent. I mean, specifically, a decision to 
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use or not to use a test or measure for the identification and hiring of personnel. 
With regard to this type of decision, I would like to make two points: 

(1) a correlation design will show what criterion performance can be 
expected from persons with a given test score, thus giving information 
basic to the decision in question, 
and 

(2) a controlled experiment (showing the relationship between a test and 

an appropriate criterion with other variables held constant) may 
suggest but does not demonstrate the value of this independent vari- 
able for selection purposes. 

My point regarding the correlation design should be clear and acceptable 
without elaboration. The second point calls for further discussion. 

A true controlled experiment is in some ways quite meaningless when 
a test of an ability or personality trait is the independent variable. A test 
score cannot be meaningfully manipulated—the individual differences in 
the test score must be taken as they come or created by selection of cases. 
Moreover, experimental controls are difficult to accomplish. Such controls 
must again be achieved by selection of cases. Most important, however, it 
is difficult to define the “other variables” that are to be held constant in 
a controlled experiment. Consider, for example, the consequences of holding 
constant an alternate form of the test used as the independent variable, or 
the consequences of holding constant a number of tests so chosen that the 
common-factor variance of the independent variables will be reduced to zero. 

If we disregarded the problems I have just raised and assume that a 
test has been found to predict a criterion, and that all other variables were 
held constant through selection of cases, an additional difficulty still arises. 
With selection of cases the sample in which this relationship is demonstrated 
can no longer resemble the sample in which the application must take place, 
and the relationship discovered in the validation sample cannot be applied 
with confidence. To further clarify this point, consider the kind of selection 
procedure that 7s supported by the evidence of a controlled experiment. 
The two steps of the procedure are: (1) selection of applicants to duplicate 
the effect of the operations used in the validation sample to hold all other 
variables constant, and (2) within the remaining applicants, selection of 
those with high scores on the test under investigation. Needless to say, this 
two-step procedure is not appropriate to the practical problem. 

Although, as I had suggested earlier, the thoughts expressed with regard 
to these two designs are not new, I hope that the examination of the designs 
in relation to the decisions to which they are pertinent may have provided 
some new insight into the distinction between these problems. 

A second general distinction between classes of problems arises in con- 
nection with scaling. Consider, as one problem, the search for units of measure- 
ment that have the properties of a true scale. Many authors have struggled 
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with this problem as it relates to the general methodology of science and 
most agree to a number of desirable features of so-called true scales. Such 
scales are fairly common in the physical sciences. In psychology, they are 
sought after but rarely achieved. 

A second class of scaling problem is, I believe, distinctly different from 
the problems associated with true scales. If the scaling problem is pointed 
toward practical application and tied to decision theory, we are then seeking 
units such that the scale has the properties necessary to permit the decision 
in question. Thus, particularly in connection with criterion problems for 
personnel selection research, the notion of a common metric and the notion 
of equal units of scaling deviates from the philosophy behind true scales and 
takes on a very different meaning. It may well be that a dollar unit is a true 
common metric and a true scale unit in all of the senses relevant to decisions 
properly made as a result of selection research. 

Yet such a scale, while highly relevant to this kind of decision, has no 
apparent relevance to the scaling problems mentioned earlier. If we con- 
struct a performance test to measure—job sample wise—proficiency in a 
particular job element, we should not seek a scale that discriminates equally 
well at all levels of difficulty and conforms to the desirable attributes of a 
true scale. We seek a count of behaviors that is representative of the actual 
behavior in this particular aspect of the job. Such a scale will discriminate 
only at the appropriate level of difficulty whether the behaviors required 
of the job incumbent are all very easy or all very difficult. In other words, 
we seek a count of behaviors, and we seek them at a difficulty level such that 
they can be properly evaluated as representing profits or losses to a decision 
maker—assuming that the purpose of the decision maker is to maximize 
profits. 

I suspect that this difference in the purposes of scaling—as seen in a 
practical decision problem on the one hand and in the development of a 
general body of scientific knowledge on the other—can be «fferentiated fur- 
ther. I suspect, also, that many investigators have not distinguished between 
these two types of scaling problems and that the scales developed may have 
been less adequate or less suitable as a result. 

In summary, let me point again to several of the major points of this 
paper. Let me repeat and emphasize my belief that, in developing a method- 
ology, we must closely examine the decisions to be made or the conclusions 
to be drawn when the methodology is applied. The methodologies should 
be molded so that a correct decision can follow an application of the method- 
ology, and the chain of reasoning, in my opinion, best proceeds from a defini- 
tion of the decisions to be made to a justification of the methodology. 

The three examples involving a distinction between the least squares 
solution and solutions offering measures free of bias have suggested a second 
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point. I believe that this distinction can be usefully applied in other con- 
texts and that some added insight will obtain. 

Finally, I hope that the foregoing has clarified my most general thesis 
and that I have given some support to the notion that effort expended in 
seeking new problem statements can be profitable. 


Manuscript received 9/4/57 
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OPTIMAL TEST LENGTH FOR MULTIPLE PREDICTION: 
THE GENERAL CASE* 


PauL Horst AND CHARLOTTE MacEwan 
UNIVERSITY OF WASHINGTON 


The concepts of differential prediction and multiple absolute prediction 
were developed in earlier papers [2, 3]. Methods for determining optimal dis- 
tribution of testing time for each type of prediction are available [4, 5] and 
are appropriate for use provided that no altered time allotment approaches 
zero. In this article the methods developed in [4, 5] are extended to include 
cases where the altered time allotment for one or more tests may approach 
zero. The procedures developed are illustrated by numerical examples, after 
which the mathematical rationales are provided. 


In previous publications [2, 3, 4, 5] the problem of maximum validity 
in predicting multiple criteria was approached in two different ways. In [2] 
and [3], for predicting criteria differentially and for multiple absolute pre- 
diction, respectively, techniques were developed for selecting from a large 
number of potential predictors that subset, of specified size, which yields 
the highest over-all validity as measured by the respective indices of prediction 
efficiency, ¢ and \. A more general approach was used in [4] and [5], in which 
a procedure previously presented for the case of a single criterion [1] was 
extended to the cases of differential prediction and multiple absolute pre- 
diction, respectively. Here, techniques were presented whereby, starting 
with a given battery of predictors for differential prediction or for multiple 
absolute prediction, one could determine altered administration time allot- 
ments, for any specified over-all testing time, for which the index of prediction 
efficiency (¢ or \, respectively) would be a maximum. 

The techniques developed in [4] and [5] provide methods of solving for 
optimal test lengths, in terms of time allotments, by series of approximations. 
Since reciprocals of the altered time allotments are involved, the methods do 
not hold in the event that any altered testing time becomes zero. In this 
article a modification of procedure, applicable also in the case in which 
the new time allotment for any test approaches zero, is presented. 

A numerical example for the case of differential prediction, and a sum- 
mary for the case of multiple absolute prediction follow in the next section. 
The mathematical basis for the procedures described for the general case is 
presented in the final section. 


*This research was carried out under Contract Nonr-477(08) between the University 
of Washington and the Office of Naval Research. The authors express their appreciation 
to Shun Mei Ling for carrying out the computations, and to Elizabeth Cross for typing 
the manuscript for publication. 
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Numerical Examples 


The General Case for Differential Prediction 


The example below demonstrates a modification of the computational 
procedure presented in [4] such that its applicability is perfectly general. 
The assumptions stated for the more restricted case [1, 4] also apply in the 
general case but will not be repeated here. 

The data used in this example are those used in [4]. The matrix of test 
intercorrelations with reliabilities in the diagonal is shown in Table 1. Criterion 
variables are grade-point averages in each of ten college areas. The matrix 
of validity coefficients is shown in Table 2. 

Over-all testing time for the tests of arbitrary length is 142 minutes. 
Assume, as was the case for the example in [4], that the total testing time is to 
be cut in half, that is, to 71 minutes. The problem is to determine time 
allotments for the various tests such that the resulting index of differential 
prediction efficiency is maximized. The following method of solution employs 
a series of approximations differing from that presented in [4]—with the 
exception of the first iteration, no reciprocals of the altered time allotments 
are involved. 

It will be demonstrated that the results obtained by the original and the 
modified procedures are, for practical purposes, virtually identical. Since we 
start with no near-zero test lengths, the somewhat shorter method, as de- 
scribed by steps 1—8f in [4] may be used to obtain the second approximation 
to optimal test lengths. In brief, by these steps we determine: 

1. The a! matrix shown in Table 3. This is obtained from Table 2 by 
subtracting the mean of column 7 from each element in column 17. 

2. The elements in the diagonal matrix, A, shown in row 2 of Table 4. 
Each element is the original test length, given in row 1 of Table 4, multiplied 
by the corresponding unreliability. 

3. The first approximation to the altered test lengths. Assume each 
test length cut in half as shown in row 3 of Table 4. 

4-5. The values shown in row 4 of Table 4. Each element in row 2 of 
Table 4 is divided by the corresponding element in row 3 of Table 4. 

6-7. The matrix L, . To compute the matrix LZ, , first make up a matrix 
as follows: Using the R matrix of Table 1, the value of each diagonal element 
is increased by adding to it the value of the corresponding element in row 4 of 
Table 4. For example, the first diagonal element of the new matrix is .920 + 
.160 = 1.080. The ZL, matrix is obtained by premultiplying the matrix a, 
of Table 3 by the inverse of the augmented R matrix. The procedure for 
premultiplying a matrix by the inverse of a symmetric matrix is outlined in [6]. 
The solution is found in two stages, the ‘forward solution” and the ‘‘back- 
ward solution,” both of which may be seen in [4]. In this report only the 
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backward solution, in Table 5, showing the transpose of the Z, matrix in 
the upper left section is reproduced. 

8. The second approximation to the altered test lengths. The compu- 
tational procedure is that used in [4] and is shown in rows a through f in 
Table 5. 

Row a consists of the sums of squares of column elements of the Lj 
matrix. For example, the first element in row a, .0626, is the sum of squares 
of the first 10 elements in column 1 of Table 5. 

Row b is copied from row 2 of Table 4. 

Row c consists of the products of corresponding elements in the two 
preceding rows. For example, the first element in row c, .1251, is .0626 X 2.00. 
(In the original computations, six decimals were retained in the elements 
of row a.) 

Row d consists of the square roots of the corresponding elements in the 
preceding row. For example, the first element is 1.1251 = .3537. The 
value of s, as seen to the right of this row, is computed as the over-all new 
testing time, 71 minutes, divided by the sum of elements in row d, 1.8823. 
The quotient is 37.7198. 

Row e gives a check on the computations for row d. Each element in 
row c is divided by the corresponding element in row d. Thus, .1251/.3537 = 
3037. 

Row f has as elements the second approximations to optimal test lengths. 
These values are found by multiplying each element in row d by the obtained 
value of s. Thus, for the first element, .3537 X 37.7198 = 13.3415. Summed, 
the values in row f should equal 71, the over-all new testing time in minutes. 

Since there are no near-zero values in row f,; normally one would continue 
in the manner described in [4] to obtain the third approximation to altered test 
lengths; i.e., in terms of the present report, substitute the values in row f 
of Table 5 for those in row 3 of Table 4, and repeat steps 4-5 through 8f 
to compute the third approximation to optimal test lengths. 

Assume, on the contrary, that some test length as given in row f of 
Table 5 were near-zero or zero. Under these conditions, it would be difficult 
or impossible, in the succeeding iteration to carry out the computations 
indicated in step 4-5. The modified procedure described below avoids such 
an impasse. This procedure may be employed with complete generality. 
The calculations in Table 5 are completed as follows. 

Row g consists of the square roots of the corresponding elements in the 
preceding row. Thus, for the first element, ~/13.3415 = 3.6526. 

Row h gives a check on the calculation of row g. Each element in row f 
is divided by the corresponding element in row g. For example, 13.3415/3.6526 
= 3.6526. 

The elements of row g will be used subsequently for a number of opera- 
tions. 
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9. The matrix shown in Table 6 is calculated next. Each column of 
Table 6 is obtained by multiplying each element in the corresponding column 
of the R matrix shown in Table 1 by the corresponding element in row g. 
For example, for the first column of Table 6, the first two elements are: 
.920 X 3.6526 = 3.3604; .159 XK 3.6526 = .5808. 


TABLE 6 


The rot, Matrix 








1 2 3 4 5 6 z 


3.3604  .3729 .5429 .9962 2.7033 1.9462 9.9219 
5808 2.1575 .0107 1.3082 1.0346 .9183 6.0101 
25552 .0070 3.2860 .7090 5031 3=.5669) 44. 4934 
1.0264  .8643 .7143 2.9071 1.9451 1.6099 9.0681 
2.7869 .6848  .5072 1.9463 2.9407 2.3732 11.2391 
1.8811 .5699 -.5358 1.5103 2.2250 3.2499 8.900% 





AM wh 





p> 10.1908 4.6574 4.5253 9.3771 11.3518 9.5306 49.6330 
ck 10.1908 4.6574 4.5253 9.3771 11.3518 9.5306 49.6330 





TABLE 7 


The of, ro Matrix 








1 2 3 4 5 6 z ck 





12.274 1.362 1.983 3.639 9.874 7.109 36.241 36.2h2 
1.362 5.060 0025 3.068 2.426 2.154 14.095 14.094 
1.983 .025 11.737 2.532 1.797 -2.025 16.049 16.0h9 
3.639 3.068 2.532 10.306 6.896 5.707 32.148 32.148 
9.874 2.426 1.797 6.896 10.419 8.408 39.820 39.820 
7.109 2.154 -2.025 5.707 8.408 12.281 33.634 33.635 


Au Fwnh 





10. Computed next is the matrix found in Table 7. Each row of Table 7 is 
obtained by multiplying each element in the corresponding row of the table 
computed in step 9 by the corresponding element in row g. For example, 
elements one and two of row 1 of Table 7 are: 3.3604 X 3.6526 = 12.274; 
.3729 X 3.6526 = 1.362. 

11. Calculate a matrix which shall be designated A, . The A values 
found in row 2 of Table 4 are added to the corresponding diagonal elements 
of the table obtained in step 10, and the resulting matrix is copied into the 
upper left quadrant of Table 8. The first diagonal element of Table 8 is 
2.00 + 12.274 = 14.274. Note that the elements below the diagonal are not 
copied in. 

12. The diagonal elements in the upper right quadrant of Table 8 are 
the corresponding elements of row g. 

13. Next compute the inverse of the A, matrix, postmultiplied by the 
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diagonal matrix in the upper right quadrant of Table 8. The procedure used is 
identical with that previously mentioned in connection with computing L, , 
and is outlined in ([6], Ch. 21, Sec. 7). Computations for the forward solution 
are shown in the lower quadrants of Tables 8 and 9. The backward solution 
is shown in Table 10, which gives the transpose of the desired product matrix 


in the first six columns. 


“1 
Computation of A,’ o,# » where A, = ot not +A 


TABLE 8 





Forward Solution 









































mu 2A 3A LA 5A 6A 1B 2B 3B 4B 5B 6B Check 
VA 14.274 14362 1.983 3.639 9-874 7.109 3.653 41.894 41.894 
2A 5.780 2025 3.068 92.4260 2.154 2.345 17.160 17.160 
3A 14,137 2.532 1.797 -2.025 3.572 22.021 22.021 
LA 14.446 6.896 5.707 3-545 39.833 39.833 
SA 12.969 8.408 3.543 45.913 45.913 
6A 17.881 3-779 43.013 43.013 
36.242 14.815 18.449 36.288 42.370 39.234 | 3.653 2.345 3.572 3.545 3.543 3.779 209.834 
-0701 21 14.274 1.362 1.983 3.639 9.874 7.109 | 3.653 41,894 41.89% 
<r 2 5.651 -.163 2.722 1.488 1.479 | -.347 2.345 13.180 13.175 
-0722 «3 13.857 2.105 e468 -2.971 | -.518 068 3.572 16.580 16.581 
0841. 4 11.8686 3.590 3.633 | -.686 -1.141 -.543 3.545 20.279 20.284 
ones «|S 4.645 2.103 |-2.212 -.274 .043 -1.071 3.543 6.768 6.777 
0 6 11.253 | -.627 -.127 .911 -.600 -1.605 3.779 12.969 12.98% 
TABLE 9 
Computation of a,” pt Continued 
2 
2 2 3 4 5 6 1 2 3 4 5 6 Check 
i 1.000 
2 1.000 
3 1.000 
4 1.000 
5 1.000 
6 1.000 
$ *1.000 -.095 -.139 -.255 +-.692 -.498 |-.256 -2.935  -2.935 
2 -1.000 2029 «-.482 -.263 - 262 061 8-415 -2.331 -2.332 
3 -1.000 -.152 -.034 2214 -037 -.005 -.258 -1.197 -1.198 
4 “1.000 -.302 -.306 058 2096 =.046 +.298 -1.707 -1.706 
5 -1.000  -.453 476 2059 -.009 .231 -.763 “1.459 -1..459 
6 -1.000 2056 eOlL -.082 .053 .143 -.336 -1.154 -1.15% 
TABLE 10 
Computation of A, of, Backward Solution 


Matrix (A, »,? ) 
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14. Each column of the matrix obtained in the backward solution now 
is multiplied by the corresponding element of row g. For the first element in 
column 1, we have .576 X 3.6526 = 2.104. The resulting matrix, with cor- 
responding off-diagonal elements averaged to make the matrix perfectly 
symmetrical, is shown in Table 11. 


TABLE 11 


Computation of (vf, A, “*) pt 
2 

















1 2 3 4 5, 6 z 
a 2.104 .057 -.172 +339 -1.599 -.212 .517 
2 -057 1.100 2056 =-.270 -.192 -.042 .709 
3 -.172 .056 1.022 -.220 -.098 2306 =.894 
4 2339 -.270 -.220 1.337 -.732 -.201 .253 
5 -1.599 -.192 -.098 -.732 2.934  -.540 -.227 
6 -.212 -.042 2306 = =.201 0-540) 3=1.270 581 
r -517~—-«. 709 894 2253-227 -581 2.727 
TABLE 12 


>» feeb a. wdes. 1’ 
Computation of L5 [oe A, D5 Ja. 














1 2 3 4 5 6 
1 051 -.053 -073 -051 = -. 064 -000 
2 -.082 -036 .018 .018 O44 022 
3 -003 -.005 -.003 -.009 024 == 09 
4 -131 -.003 -.076 -.062 -054 052 
5 2070 -093 -.108 -,009 -.229 -101 
6 -.175 == 043 2100 = -.058 -080 -.055 
7 -00l1 -.041 -.008 -.092 2120 -.052 
8 -.065 2104 =-.085 -086 =-.026 -003 
9 076 -.069 2075 2005 -006 -010 
10 -.005 -.022 012 -073.  -.014 = -.028 
= -005 -.003 -.002 +003 -.005 004 
Ck -005 -.002 -,00% 002-004 003 





15. Next compute, by successive columns, the matrix Lj , which is 
shown in the first ten rows of Table 12. The 7th element in the first column 
of Li is the product sum of elements in the first row of Table 11 by the cor- 
responding elements in the 7th row of the a/ matrix in Table 3. For example, 
the first element in the first column of Li is (2.104) (.023) + (.057)(—.047) + 
(—.172)(.089) + (.339)(.033) + (—1.599)(—.004) + (—.212)(—.016) = 
.051. The second column of Li is obtained in the same manner as the first 
except that the second row of Table 11 is used instead of the first. To obtain 
the third column of L/ the third row of Table 11 is used, and so on until the 
table is completed. 

16. Step 8 now is repeated, rows a through f, using the Lj matrix to obtain 
a third approximation to the altered test lengths (i.e., a new row f). These 
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computations are not reproduced here, but the values obtained in row f may 
be seen in the’third row of Table 13. 
TABLE 13 


Differential Prediction: Successive Approximations* to 1’ Dp» for tT," tr, ~ 











%: 2 3 4 5 6 £ Value of ¢ for 
Successive 
Approx'n Values of L 
(0.5)1°D,: 1 12.50 4.50 15.00 11.50 7.50 20.00 71.00 L, 227 
2 13.34 5.50 12.76 12.57 12.55 14.28 71.00 Ly 234 
3- 13.20 5.31 11.57 12.56 16.06 12.32 71.02 L3 236 
4 13.20 5.18 11.04 12.51 17.63 11.44 71.00 Ly, +237 
5 13.26 5.12 10.82 12.56 18.15 11.09 71.00 Le 2235 
6 13.32 5.12 10.75 12.55 18.37 10.91 71.01 L¢ 238 





*The third and subsequent approximations were computed by the procedure described 
for the general case. 


TABLE 14 
Differential Prediction: Successive Approximations to 1’D,» for T) = tr, = 71 


From (4, p. 60) 





Value of ¢ for 
z e 3 s ? 6 z Successive 


Approx'n Values of L 





(0.5)1°D,: 1 12.50 4.50 15.00 11.50 7.50 20.00 71.00 Li 2227 
2 13.34 5.50 12.76 12.57 12.55 14.28 71.00 Lo 2234 


3 £3.27 Suse 24655. 22252 16.05 12.29 71.00 L3 0235 
4 13.23 5.20 10.98 12.47 17.62 11.51 71.01 L, +236 
5 13.31 5.15 10.76 12.46 18.13 11.19 71.00 Ls +237 


6 13.35 5.12 10.70 12.46 18.37 11.00 71.00 lg +236 





Computations for the fourth approximation may be summarized as 
follows: (1) a new row g is computed to obtain the square roots of the cor- 
responding values in the new row f; (2) steps 9 through 11 are repeated with 
the new values obtained in row g to compute the matrix A,; (3) steps 12 
through 15 are repeated to compute Li ; (4) step 16 is repeated to obtain the 
fourth approximation. Thus, given any approximation, row g of step 8 and 
steps 9 through 16 designate the procedure which may be used with complete 
generality to compute subsequent approximations to optimal test lengths 
for differential prediction. 

In all, five approximations beyond the first were computed and are 
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summarized in Table 13. Of these, the second was computed by the procedure 
presented in [4]; approximations three through six were calculated by the 
procedure described for the general case. 

17. Successive indices of differential prediction efficiency, ¢, are computed 
as follows: 

(a) To obtain ¢, corresponding to the first approximation to the altered 
test lengths, each element in the L{ matrix is multiplied by the corresponding 
element in the a{ matrix in Table 3, and all products are summed. The 
resulting value, .227, is found as the first entry in the ¢ column at the extreme 
right in Table 13. 

(b) The value of ¢, is obtained in the same manner except that the L} 
matrix is used. 

(c) Subsequent values, ¢; , are obtained by using the elements of L/ and 
the corresponding elements of a/ in Table 3. 

Table 14 shows the results obtained in [4] by the original procedure, for 
the same data and with the over-all new testing time also equal to one-half 
the original time. Comparison of the corresponding values in Tables 13 and 
14 indicates results essentially the same for all practical purposes. The largest 
discrepancy does not exceed one-tenth of a minute, and the increases in ¢, 
though small, are comparable within the range of rounding errors. In neither 
case have computations been carried to the point of complete stabilization of 
the vector of time allotments. Results, however, appear adequate for practical 
purposes. 

The question may arise as to the stability of the time estimates from 
sample to sample. The entire problem of significance tests, however, has not 
yet been touched. 


The General Case for Multiple Absolute Prediction 


The computational procedure presented in [5] for obtaining optimal test 
length for multiple absolute prediction consists of the same sequence of 
operations as that given in [4], the difference being that in [5] the matrix of 
validity coefficients is used, whereas in [4] these coefficients in deviation 
form for each test were required. 

Similarly, the sequence of operations for the general case for multiple 
absolute prediction is the same as that presented above, the difference being 
that the matrix r, is used, whereas a, was required above. Instead of pre- 
senting a numerical example in detail for the general case for multiple absolute 
prediction, here only the procedural steps which differ from those described 
above will be indicated. Namely: 

Step 1 is omitted. 

In step designated 6-7, the r, matrix is used instead of matrix a, . 

In steps 15 and 17, the r{ matrix is used instead of matrix a’ . The above 
distinctions assume, as was assumed for the general case for differential 
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prediction, that the second approximation to altered time allotments was 
computed by the original method. 

The series of approximations to optimal test lengths for maximum 
absolute prediction, shown in Table 15, was obtained with the same original 
data as the series in the previous example, but with the over-all testing time 
taken as unchanged. Hence the original test lengths were taken as the first 
approximation. 

To demonstrate that the procedure developed for the general case yields 
the same results as the procedure presented in [5], the procedure described for 
for general case was employed immediately. The square roots of the original 
test lengths were found at once, as designated by step 8, row g of the preceding 


TABLE 15° 


Absolute Prediction: Successive Approximations* to 1D» for T) = Ty = 142 








2 2 3 4 5 6 b> Value ofA for 
Successive 
Approx'n Values of L 





(1.0)1°D,: 1 25.00 9.00 30.00 23.00 15.00 40.00 142.00 Ly 2.203 
2 32.53 10.02 10.50 21.40 18.65 48.89 141.99 Ly 2.229 
3 32.82 9.67 8.30 20.31 21.60 49.30 142.00 L3 2.230 


4 =. 32.79 9.53 7064 «19.96 23.03 49.05 142.00 Ly, 2.230 





*The second, third and fourth approximations were computed by the procedure 
described for the general case. 


TABLE 16 
Absolute Prediction: Successive Approximations to 1'D,; for T) = T> = 1h2 


From (5, p. 120) 








Value ofX for 
1 2 3 4 5 6 p » Successive 
Approx 'n Values of L 





(1.0)1°D,: 1 25.00 9.00 30.00 23.00 15.00 40.00 142.00 Ly 2.203 


2 32.54 10.00 10.45 21.42 18.66 48.93 142.00 L 2.230 
a 


2 


3 32.87 9.70 8.21 20.21 21.61 49.40 142.00 L; 2.234 


4 32.76 9.52 7-57 19.99 23.08 49.08 142.00 L, 2.232 





example. Further procedural steps followed the directions given in the steps 
subsequent to step 8, row g, with the exception, of course, that in steps 15 
and 17, the r/ matrix was used instead of matrix a/ . 

Three approximations beyond the original values were computed. These, 
with the corresponding values of the index of multiple absolute prediction 
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efficiency, A, are shown in Table 15. The corresponding results obtained by 
the original method are found in Table 16. A comparison of the two tables 
shows no discrepancy in the entire series greater than one-tenth of a minute, 
and no difference between the corresponding values of \ beyond those of 
rounding errors. 


Mathematical Derivation 


The General Case for Differential Prediction 


The mathematical rationale presented in [4] provides a solution for 
obtaining optimal test lengths by means of a series of approximations. The 
formulas derived are not applicable, however, in the event that the altered 
time allotment for any test approaches zero. The derivation which follows 
consists in developing, from the computational equations presented in [4], 
formulas which do not involve reciprocals of the altered time allotments, 
and which, consequently, provide a solution such that its applicability is 
perfectly general. Using the notation of [4], let 


the number of predictors, 


n = 

N = the number of criteria, 

r = the (n X n) matrix of intercorrelations of tests of original lengths, 

r, = the (n X N) matrix of validity coefficients for the tests of original 
lengths, 


dD, the (n X n) diagonal matrix of original test lengths, 
D, = the (n X n) diagonal matrix of altered test lengths, 

D,,, = the (n X n) diagonal matrix of reliability coefficients for the tests 
of original lengths. 


I 


I 


réé 


As in [4], define 
R=r- (I aa D,..); 


rt - ) 


A =*D,(I — D,,.), 





Il 


a, 


and again state the constraining condition, 7 = 1'D,1, where / is a vector of 
unities. 

Start with equations (43) and (44) of [4], namely, the equations from 
which the formulas for the iterative solution for D, were derived. These are, 
respectively, 

(Dir d)'?T 
ff O81 Divs? 
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and 
(2) L = (R + AD;’)"'a, , 


where D,,, is a diagonal matrix whose non-zero elements are the diagonal 
elements of LL’. Equation (2) may also be expressed as 





(3) i, — (Dy Dine" D;** + Do"abD;") a. : 
or equivalently, as 
(a) L = (Di*(D\"RDY" + A)Ds"7)"'a, , 
or finally, as 
(5) L = DY*\Di"RDY? + A)"Di"e., , 
an equation which involves no negative powers of D, . 
Let 
(6) L; = DOWD RD, + A) Dia. , 
where 
T 
(7) Ds, ait rps Mis 
and 
) D,,,, = (Par 
bits 


3 1"(Dz,1,4)'"1 


The first approximation to D, is indicated by (7). The second and all subse- 
quent approximations to D, may be obtained by an iterative procedure based 
on (6) and (8). In this manner, successive approximations to L; and D, 
may be computed until D, stabilizes satisfactorily. 


i+1 


The General Case for Multiple Absolute Prediction 


By an analogous development, it can be shown that for the general case 
for multiple absolute prediction the formula for successive approximations 
to Lis 


(9) L; = Diy(DsPRD? + A)" DY?r. , 
and that the formulas for obtaining the first and subsequent approximations 
to D, are identical with (7) and (8) above. 
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STIMULUS AND RESPONSE GENERALIZATION: A STOCHASTIC 
MODEL RELATING GENERALIZATION TO DISTANCE IN 
PSYCHOLOGICAL SPACE* 


Rocer N. SHEPARD 


NAVAL RESEARCH LABORATORY 


A mathematical model is developed in an attempt to relate errors in 
multiple stimulus-response situations to psychological inter-stimulus and 
inter response distances. The fundamental assumptions are (a) that the 
stimulus and response confusions go on independently of each other, (b) that 
the probability of a stimulus confusion is an exponential decay function of 
the psychological distance between the stimuli, and (c) that the probability 
of a response confusion is an exponential decay function of the psychological 
distance between the responses. The problem of the operational definition of 
psychological distance is considered in some detail. 


Stochastic models for learning have been developed by Estes [8], by 
Bush and Mosteller [6], and others. With the exception of a few investigations 
confined to the stimulus side of the learning process, such as that by Bush 
and Mosteller [5], however, these models have not been extensively applied 
to generalization phenomena. This paper, in using the notion of psychological 
distance, approaches the generalization problem from a somewhat different 
direction. ; 

Consideration will be restricted to situations in which a number of 
responses are discriminatively attached to a number of stimuli by consistent 
application of differential reinforcement. More precisely, the learning process 
will be supposed to conform to the following rules: (a) On any given trial a 
single stimulus is presented at random from a set of N stimuli. (b) On each 
trial the subject is constrained to a fixed set of N responses. (c) For any 
given subject, there is a prevailing one-to-one assignment of the N responses 
to the N stimuli arbitrarily determined in advance such that a certain rein- 
forcing operation (e.g., the word “correct’’) is applied if and only if the pre- 
sentation of a stimulus is followed by the occurrence of its assigned response. 
The present model, however, is concerned not with the learning process 
per se but with the pattern of generalizations exhibited at any one given 

*This paper is based upon the theoretical sections of a Ph.D. dissertation submitted 
to the Graduate School of Yale University and upon subsequent modification carried out 
on a National Academy of Sciences-National Research Council Postdoctoral Associateshi 
at the Naval Smnareh Laboratory. The author is particularly indebted to Drs. C. 
Hovland, R. P. Abelson, and B. S. Rosner for their generous advice and support. Helpful 
criticisms have also been contributed by Drs. G. A. Miller, F. A. Logan, Ww D. Garvey, 


J. G. Holland, and H. Glaser. 
{Now at Psychological Laboratories, Harvard University. 
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stage of learning. Furthermore, as an approximation for any given set of 
stimuli or responses, all subjects are assumed to generalize according to the 
same pattern. 

Ordinarily there is no necessary or natural correspondence between 
the stimuli and responses and, indeed, different assignments may be set up 
for different subjects. It is convenient, therefore, to have a way of referring 
to the response which has been assigned, for a given subject m, to a given 
stimulus, S; , without having to ask just which one of the N responses that 
may be. Accordingly, the following definitions are introduced: 


S; = the 7th of the N stimuli S, , S., --- , Sy; 

R; = the 7th of the N responses FR, , R,, --- , Ry ; 
R.i),m = the response assigned to S; for subject m; 

S(i), = the stimulus to which R; is assigned for subject m. 


Thus the set of all stimulus-response sequences, for any subject m, divides 
into (a) the subset of reinforced sequences of the form S; — R,;),, and (b) 
the subset of nonreinforced sequences of the form S; > Ra), with i ¥ k. 
At any given stage of learning there will be, for every stimulus and 
every response, a probability that the one will be followed by the other. 

These probabilities are designated as follows: 
Pix.m = P{R, | S:]m = the conditional probability of R, , given S; , 

for subject m. 

Picxy)im y Peiyem » ANd P.;)¢~),m are defined in an analogous manner. Thus 


Px) .m = P[Ruy.m | Silm = the conditional probability of Ruy.m , 
given S; , for subject m. 
The responses are partitioned so as to be mutually exclusive and exhaus- 
tive. The conditional probabilities, therefore, satisfy the requirements 


(1) 2 = 1, Pik) ,m =O. 
k 


If the probabilities P;,;),,, increase with continued application of the 
reinforcing operation, there must result a decrease in some P;.,),, With 
t ~ k. Although it is known that the probabilities of the various incorrect 
responses, called generalization errors, do not in general decay at the same 
rate, little advance has been made towards the quantitative understanding 
of this aspect of the learning process. About all one has to go on is the 
qualitative observation that, at a given stage of learning, the probability, 
P5¢),m , Aecreases both with the dissimilarity of S; and S, , and with the 
dissimilarity of R,;),, and Ry), . 


The Reduction of the S-R Process to an S-S and R-R Process 


An error in which the response assigned to a stimulus, S; , follows the 
presentation of another, S; , (as in B of Fig. 1) may be viewed as comprising 
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FIGURE 1 


Different ways of conceptualizing an S-R sequence as generated by an S-S and an R-R 

sequence. The circles on the left stand for the different stimuli and the circles on the right 

for their assigned responses. In A, for example, S; was presented and followed by its assigned 
response, R(i),m. In B, however, S; was followed by an incorrect response, R(j),m. 


two events as illustrated in C. It may be that the subject confused two 
stimuli: when S,; was presented, it was taken to be S, . Now, if S, is taken 
to be the stimulus, the response which should ensue is Ry,),,, . Suppose, 
however, that the subject also confused two responses; whereas the tendency 
was to make Ry), , the response actually made (according to the external 
criteria) was R,;),,, . In this way an S-R transition may be analyzed into an 
S-S transition and an R-F transition. Alternatively, there may be a stimulus 
confusion without any response confusion (D), a response confusion without 
any stimulus confusion (E), or both stimulus and response confusions which 
so counteract each other that the correct response is made (F). 
This analysis suggests the following additional definitions: 


*, = P[S{| S;] = the conditional probability that S; , when pre- 
sented, will be taken to be S, . 

*, = P[R, | R{] = the conditional probability that R, will be made 
in place of R; . 
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The term introduced in the second definition is also the probability that, 
when the stimulus is taken to be S; , R, will follow, since the model is set 
up so that 
Piri | si)= jf t= 
0, otherwise. 

Thus, the conditional probabilities are treated as if the subject (a) knows 
the connection between any stimulus and its assigned response but (6) still 
makes errors owing to a certain inability to identify the stimulus or reproduce 
the response with sufficient accuracy (see Fig. 1). The analysis, then, is 
applied to the confusions among the stimuli and among the responses but 
not to the associations between the stimuli and the responses. 

Strictly speaking, the S-S and R-R transition probabilities pertain to 
a short interval of time during the learning process so that a stable state 
may be assumed to exist. Over extended periods the probabilities will change 
owing to the effects of reinforcement. At any given time, however, they 
must satisfy the conditions 


(2) >> PS, = 1, a > 0, 
k 

(3) > Pi = 1, + > 0. 
k 


The fundamental assumption which will be made here is that, at each 
stage of learning, the response confusions are independent of the stimulus 
confusions, or, more explicitly, that P?,,,,),,, does not depend upon which 
stimulus was presented (and taken to be S;) on the trial considered. This 
assumption implies that 


(4) P sci) 4m aa 2 Par Pens . 
k 

Since the indices 7, j, and k are allowed to range over the values 
1,2, --- , N, (4) corresponds to the defining equation for matrix multiplication, 
so that 
(5) Ps (p).m _ Ps g°Porycr).m 
where, for example, 

Pian P1(2).m i es Prciyia} 


is 1 P coe P. 
(1),m 2(2),m 2(N),m 
(6) Ps:2),m a! cs ‘ if 








LPriiy.m Py2).m — Pywny m1 
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(Dr. Burton S. Rosner has independently proposed essentially the same 
treatment for the S-R process.) 

As the form of the definitions suggests, there is an S-R symmetry in 
the notation such that the content of (4) can be set down in an alternative 
form giving the probability, when S;,;) is presented to subject m, that R; 
will ensue. 


(7) F astm = De Pl aym Ps . 


The matrix representation becomes 
(8) Pis)r.m a P(s)(s),m*Prr . 


Equivalent equations (5) and (8) can be reduced to a single form without 
parentheses around the subscripts. In order to do this, it is convenient to 
introduce permutation matrices, J, , with elements J;,,,, , such that 
(9) i 1, if R, is assigned to S, for subject m, 

0, otherwise. 
By carrying out the indicated multiplications, the following identities may 
be verified. 


Pissm = P.csy.m'Jm » 
Prr.m = Picz).m'Jm » 
(10) Pgs.m = Jn'Pis2,m 5 
Prem = Ju Pizys.m > 


Jm'Jm = Jm*Jm = I. 


Here, the subscript x stands for any of the symbols S, (S), R, or (R); JZ 
is the transpose of J,, ; and I is the identity matrix. 
Using these relations, (5) and (8) can be brought into the form 


(11) Psr.m — Pss°Jn*Pre ? 


where Psz,, is the matrix of S-R transition probabilities P;;,,, . 

Now, although J,, is known and Psz,,, can be estimated from the experi- 
mental data, neither Pss nor Prr can be directly determined. It might be 
thought, however, that by assigning the responses to the stimuli in a different 
way for each subject, the influence of the response confusions could be 
counterbalanced out of (11) so that Ps s could be solved for in terms of Psp . 
Suppose, then, that there are M subjects, each with a different assignment, 
so that, over the set of all M assignments, every pair of responses is assigned 
to each pair of stimuli the same number of times in both of the two possible 


orders. 
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Using the simplified notations 


1 
(12) P s.r) ry M 


PscR),m ? 





iM eM 


1 
M Piz) (R),m ’ 

and assuming that all the subjects are essentially alike so that Pss is inde- 
pendent of m, equation (5) may be averaged over all M subjects to yield 


Ps (x) a Ps5°Pcr 2) . 


(13) Pie) = 


Postmultiplying through by Teste» 
(14) Pss = Pscay*Pirce - 


The assumption that all subjects tend to confuse stimuli in accordance 
with the same pattern is analogous to the similar assumption made in order 
to pool data from different subjects in psychological scaling procedures. 
This assumption is probably correct only as a first approximation since the 
tendency to confuse any particular pair of stimuli probably depends to some 
degree upon the history of discrimination learning associated with that pair. 

In order to evaluate the inverse R-R matrix, it may be noted from (10) 
that 


(15) Pizy(ry.m = Jm'Pre'Jn ; 


where the assignments are so chosen that the matrices J, and J,, select, for 
each nondiagonal cell of P.»):n) , elements PZ (¢ ~ k) from every non- 
diagonal cell of Pgz an equal number of times, as m ranges from 1 to M. 
Likewise, for each diagonal cell of P,x).x) , elements P# will be equally 
selected from each diagonal cell of Pp . Thus, by definition of the matrix 
elements J,,,,, and J/,.,, , averaging over all assignments insures that 


1 1 
M > Pian - M i z > F coin" Pan* Than 


(i#k) (9#h) 
1 
“WD eH EPA 
(16) “(oxh) 
1 1 
M 2 iad M 2. : XP oR i 


1 
=v UP =, 


where P” is the mean probability that two different responses are confused 
and Q” is the mean probability that a response is not confused with any 
other. Since the total probability must be conserved, 
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(17) Q* + (N —1)-P* =1. 


One way of understanding the operations represented in (16) is to mul- 
tiply, e.g., an arbitrary 3 X 3 matrix (for Pee) by all six possible 3 X 3 
permutation matrices and their transposes as indicated in (15). The sum of 
these products will contain equal nondiagonal elements, P“, and equal 
diagonal elements, Q”, as required. A minimum set of permutation matrices 
having the necessary properties for N = 9 is given in the appendix. 

Equation (14) may thus be written in the form 











| Q? p® Bole PF} 
: a R = ie | as 
(18) Py; = Ps (z)° : @ : 
| P® P® ike Q? | 
But the inverse matrix has a simple representation such that 
[(a—P") —-P®  ... —pF 
1 pp" ee ae —p* 
(19) Pss = Pita ia = ( : ) : 
| —Pp* —P® --» (1— P*)] 





as may be verified by using (17) and showing that the product of the original 
matrix (with elements P” and Q”) and its inverse representation in (19) 
yields the identity matrix. 
Expanding with respect to the general term in (19), the probability 
that S; will be taken to be S, is 
R R 
iS pte os ni Pip - 


(isk) 
Using (1) and (17), this may be reduced to 

$(a). = P ot 

1 — NP* 

Thus, although the response confusions are not entirely eliminated by 
employing different S-R assignments, they are consolidated in the single 
parameter P”. In principle this parameter could be empirically estimated in 
special cases (e.g., with unidimensional stimuli) by extrapolating a fitted 


S-R transition probability function (for pairs of stimuli) in the direction of 
increasing stimulus dissimilarity. For, by (20), 


(20) Fs A - 








332 PSYCHOMETRIKA 


as Pi, > 0, Pia) > P*. 


In practice, if the responses are highly distinctive so that P” is close to zero, 
the probabilities P;,,) can be taken as estimates of the probabilities Pj 
with less R-R probability contamination than would be possible without the 
counterbalancing technique. The reason for this will appear in the discussion 
of estimation procedures. 

If the individual R-F transition probabilities are desired, (8) may be 
averaged over all M assignments. Employing arguments analogous to those 
developed before, the probability of an R-R transition is found to be 


Puy, — P* 
91 Pt ie (i)k A 
= "1 = NPS 


This is the inverse of (20) in that P* denotes the mean transition probability 
taken over all pairs of stimuli. 

The utility of reducing the S-R transitions to S-S and R-R transitions 
can now be ascribed to the consequent increase in predictive power of the 
model. If the S-S probability matrices have been determined (in H experi- 
ments) for each of H sets of N stimuli and if the R-R matrices have been 
determined (in H further experiments) for each of H sets of N responses, 
the total number of experiments for which the S-R probability matrices 
can be predicted is N!-H’. For, returning to (11), there are N! distinct 
matrices which are substitutable for J, (each corresponding to a different 
S-R assignment with one set of stimuli and one set of responses) and there 
are H’ distinct pairs consisting of one set of stimuli and one set of responses. 
The ratio of the number of experiments for which prediction can be made 
to the number already carried out is N!-H/2. In contrast to this, if the 
S-R probabilities are regarded as irreducible, predictions could be made only 
to replications of experiments already carried out, and the ratio just 
considered could never exceed unity. 


The Characterization of the S-S and R-R Processes in Terms of Psychological 
Inter stimulus and Inter response Distances 


In the preceding section the S-R process was reduced to S-S and R-R 
processes which, in turn, were characterized by matrices of S-S and R-R 
transition probabilities. The purpose of the present section will be to reduce 
these matrices, each with N(N — 1) independent probabilities, to sets of 
fewer than N(N — 1) quantities. Such a reduction is suggested by the possi- 
bility that some simple relation exists between Pj, and P,; of the S-S 
matrix, and between Pjj and P,* of the R-R matrix. This, in turn, appears 
plausible if the probability of confusing two stimuli (or responses) is some 
function of the dissimilarity between them so that, say, P4 and P,} will 
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increase or decrease together as the dissimilarity between S; and S, is made, 
respectively, smaller or larger. 

Instead of formally introducing the notion of dissimilarity, it is prefer- 
able to define the concept of distance, which has the advantage of a rigorous 
mathematical interpretation. Explicitly, a set of distances, D,, , defined for 
all pairs of elements, S; and S, , is any collection of numbers satisfying, for 
every S; , S; , and S, , the following requirements called metric axioms 
((2], pp. 5-16, [16], pp. 118-119): 


(22) Dn =0, if «=k, 
(23) Dy = Di ; 
(24) Di + Diy = Di; - 


When speaking of the distance between S; and S, , the symbol Dj 
will be used. Similarly the distance between R; and R, will be distinguished 
by the symbol Dj. . Any set of elements for which a distance function satisfying 
the metric axioms has been defined is called a metric space. The space may 
be called a physical or a psychological space depending upon whether the 
distances are determined from physical or psychological data. (An example 
of a physical space is the set of sinusoidal tones with 


Diu = (fi - fi)” + (a; - a,)"]'”, 


where f; is the frequency and a; the amplitude of tone S; . That this definition 
satisfies axioms (22) and (23) is immediately clear. The satisfaction of (24) 
follows from the inequality of Schwarz. For a review of some psychological 
measures which could presumably be used to construct a psychological 
space for this same set of tones, see Messick [17].) 

It will be assumed that there exists some function, f, such that P,{ is 
proportional to f(D,). The factor of proportionality must depend upon 7 
for, although the average distance of S; from the other stimuli in the learning 
situation may be large or small, the probability of transition from S; to S, 
summed over all & must be conserved by equation (2). Thus the relation 
may be set down in the preliminary form 


(25) = d;-f(Di, ’ 


where d; is a constant associated with S; , and where Dj, satisfies the metric 
axioms. Summing over all k, 


dU Pi =1=4d;- 2d f(Din). 
Solving for d; , it is immediately found that 


s f(Di) 
26 Pe, ae 2. 
~ 2d, f(Din) 
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At this point some decision must be reached concerning the nature 
of the function f. This, of course, is one way of formulating the problem 
traditionally investigated in studies of stimulus generalization. The inde- 
pendent variable in such studies is some measure of stimulus dissimilarity, 
and the dependent variable is some measure of stimulus confusability (like 
the probability that the response, reinforced to one stimulus, will occur to 
the other). 

Now, although these studies lend support to the conjecture that f is a 
continuous monotonically decreasing function, attempts to specify it with 
greater precision have not led to any consistent picture ([14], pp. 616-617, 
[26], pp. 577-579). This may be a consequence, at least in part, of the variety 
of independent measures employed. The most frequent measures of dis- 
similarity which have been used are distance on a physical scale and number 
of just noticeable differences, JNDs, separating two stimuli. However, 
there are theoretical objections to either of these measures. 

That psychological distance or confusion probability is not an invariant 
function of physical distance is now well known. Some investigators, though, 
have supposed that the summation of JNDs provides the kind of measure 
required ([15], pp. 183-225). Unfortunately, in order to sum JNDs between 
two stimuli, this summation must be carried out along some path between 
these stimuli. But the resulting sum will be invariant and, therefore, possess 
fundamental significance only if this path is a least path, that is, yields a 
shortest distance (in psychological space) between the two stimuli. One 
cannot presume, in arbitrarily holding certain physical parameters constant 
(as is ordinarily done in the summation of JNDs), that the summation is 
constrained thereby to a shortest path (or geodesic) in psychological space, 
even though it is, of course, confined to a shortest path (or straight line) 
in physical space. Indeed, given any particular summation, there is no way 
of ascertaining whether it was or was not carried out over a least path. 

These considerations lead one to look for some way of estimating the 
psychological distance between two stimuli without depending either upon 
physical scales or upon any arbitrary path of integration. One possibility, 
suggested by Gulliksen and Wolfle [11], is to use a judgmental procedure in 
which subjects directly estimate the similarity of stimuli. Indeed, Plotkin 
22] and, later, Attneave [1] in their studies of stimulus generalization in 
paired-associates learning have used techniques of this kind. Although they 
were able to demonstrate a positive correlation between confusion frequency 
and judged similarity of stimuli, the exact form of the relation between 
these variables was not pursued. 

More recently, a number of multidimensional scaling methods have been 
developed which make possible the determination of a set of interstimulus 
distances solely on the basis of similarity judgments [17]. Thus one might 
now extend the kind of approach proposed by Gulliksen and Wolfle to the 
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quantitative study of generalization in paired-associates learning. However, 
beyond the fact that these judgmental methods are limited in application to 
mature human subjects, there appears to be no readily available means for 
interpretation of the dependent variables of these scaling procedures within 
the framework of existing behavior theory. Furthermore these methods 
have not, as yet, been extended to the response domain. 

To maintain the integrity of the present approach, it seems desirable to 
avoid the use of methods which essentially fall outside the scope of the be- 
havior model to be constructed. Instead of starting with an arbitrary measure 
of psychological distance in order to discover the relation between this measure 
and consequent stimulus confusion probabilities (the traditional approach 
to the generalization problem), the present strategy will be to begin with the 
confusion probabilities themselves and then, proceeding in the reverse 
direction, to discover a function, f, which will transform these probabilities 
into measures satisfying the metric axioms. 

Actually, there are many functions having the necessary properties. 
This is because the so-called triangle axiom given in inequality (24) is not 
particularly stringent. However, that requirement can be usefully strength- 
ened by making the reasonable assumption that physical space can be mapped 
into psychological space by a transformation which is not only continuous 
but also has continuous first partial derivatives. Such a transformation carries 
any straight line in physical space into a smooth (differentiable) curve in 
psychological space. 

The importance of this assumption derives from the fact that a segment 
of a differentiable curve approximates more and more closely a segment of a 
straight line as the two segments are made shorter and shorter. In the limit, 
for three stimuli, S; , S; , and S, , such that S, is between S; and S; on a 
single physical dimension, axiom (24) should go over into 


(27) Di. + Dis = Di; ; 


provided that S; and S; are sufficiently close together in physical space. 

It is possible to demonstrate (by introducing further assumptions) that 
an exponential decay form can be deduced for the function f so that (27) 
will be satisfied for any three properly chosen stimuli. In order to main- 
tain the continuity of the present argument, however, it will be taken as 
primitive that f is an exponential decay function. The justification for this 
choice will have to depend upon the empirical results which follow from 
its use. It might be noted, however, that, apparently largely on the basis of 
Hovland’s results [13], Hull postulated an exponential decay function ((15], 
pp. 183-225). Other generalization studies have also obtained data roughly 
consistent with this assumption [3, 9, 10, 12, 19, 23]. In any case, the ex- 
ponential function is perhaps the simplest function with the desirable behavior 
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that, as its argument ranges from zero over all positive bounds, it subsides 
asymptotically from a finite value towards zero. 
Substituting an exponential decay function for f, (26) becomes 


28 A = exp (— Dé) - 
si zs exp (— Djs) 


h 





Since psychological distance is symmetric in accordance with (23) and 
since the psychological distance between any stimulus and itself must be 
zero by (22), the entire matrix of transition probabilities may now be recon- 
structed on the basis of just N(N — 1)/2 distances. Thus only half of the 
degrees of freedom in the probability matrix may really be free in the theo- 
retical sense. 

The analogue of (28) is, for the response process, 


r R Sa Di.) 
(29) a gees ; 
» exp (— Dis) 


h 





The choice of the exponential function in (29) has little precedence 
since the response generalization function has been investigated in only a 
limited number of studies [7, 20]. The particular choice is made on the basis 
of the same arguments assembled in support of the selection of that function 
in relation to the stimulus process. In addition, the symmetry of the model 
suggests that the same function may apply in both cases. 


The Introduction of the Stimulus and Response Weights 


After affecting the reductions of the last section, it must be acknowledged 
that they were based on the questionable assumption that the psychological 
distance from S; to S, is always identical to the psychological distance from 
S, to S; . It has, for instance, long been known that unfamiliar stimuli tend 
to be mistaken for familiar stimuli. Thus, under brief exposure, downwark 
may be seen as downright, whereas the reverse seldom occurs ([21], p. 360). 
One might therefore suppose that the distance from downwark to downright 
is considerably less than the distance in the reverse direction. 

Now, although it is possible to construct a consistent distance geometry 
even if the symmetry requirement is dropped from the metric axioms ((4], 
pp. 3-4), to do so here would be to relinquish that possibility which provided 
the impetus for introducing the distance notion in the first place, that is, 
the possibility of completely characterizing the S-S probability matrix in 
terms of a substantially reduced number of quantities. 

However, it may be that apparent violations of distance symmetry 
can always be traced to some factor, like familiarity, which pertains to in- 
dividual (rather than to pairs of) stimuli and which has the consequence 
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that the presentation of one stimulus leads to the perception of another 
more frequently than would be expected knowing the probability of perceptual 
distortion in the opposite direction. With each S, , then, there may be asso- 
ciated a weight, W%, , such that, if S; is presented, the probability of per- 
ceiving S, is proportional to W% . Equation (28) will then assume the modified 
form 

> W,, exp (— Din) 





The introduction of the term W*% provides for both the redundancy in 
the probability matrix and, presumably, any asymmetry among the stimulus 
similarities. In particular it is assumed that the entire S-S transition prob- 
ability matrix can now be reconstructed on the basis of N(N — 1)/2 inde- 
pendent distances together with N weights. The ratio of the number of 
independent quantities reconstructed to the number used in the reconstruction, 
therefore, can be shown to be 2(N — 1)/(N + 1). 

Postulating that the response process also involves, for every response 
R, , a weight or response preference W%, , (29) becomes 


aX Wr exp (— Dis) 





In order to estimate the stimulus weights from the S-S probabilities, 
(30) may be taken as a starting point. The probability that S; will be cor- 
rectly perceived is 

8 Ss 
(32) phe ee Be 
> W, exp (= Din) 





Since the weights occur in both the numerator and denominator, they can 
be viewed as containing some arbitrary multiplicative factor which may be 
chosen, for convenience, so that 


i ae 
(33) v > Wi = 1. 
Now from (22) it follows that exp(—D,8) = 1. Whence 
i ¥ 
(34) = wt yi (— Di). 


To eliminate the distance term one has simply to form the product 


Ee)” (Bey st Wi 


35 — = 
(86) Ps) ‘\ps Ws 
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If this equation is summed over all 7, the weights may be obtained except for 
a factor, >; (1/W*), which does not depend upon k and, so, is determined 
by (33). 

By substitution of (20) into the summed equation, the weights may be 
expressed in terms of the overt S-R transition probabilities through the NV 
equations, 





Pyaw — =% = — P*\-172 
N da — P* Pic) ya, 


yr > (Esws aS 7 ra a 


7 D> R R 
. s Pay © i ea 
Following a similar derivation, the response weights can be shown to 
be given by the N equations 


N te — wey rye (k)i_ ey 





(36) Wi = 














a i P 7) P* P — P* 
(37) Wi = “9 ms oe 
> » [Fins ro a (Pave as P 
oa Pi, — P* Puy, — P* 


The Resolution of Psychological Distance into Orthogonal Coordinates in Psycho- 
logical Space 


In this section procedures will be set forth for estimating the distances 
Dj. and D . In addition, a further reduction of the distances will be proposed 
so that the transition probability matrices can be reconstructed on the basis 
of still fewer quantities. 

If, in (35), the same terms are used but both exponents are taken as 
positive, the weights may be eliminated to yield 


P a)" (Pe) 
38 (Es (ag = exp (— D3). 
oe ry 6 ipitiirics 
In terms of the observed S-R transition probabilities, then, the distances are 
given by the N’ equations 
Pi) — P :) (uo =P)" 
P (i) pas p* Pro) 7 
as may be seen by a substitution from (20). 

Of the N’ distances given by (39), N(N — 1)/2 have been supposed 
to vary independently. However, suppose the N stimulus points can be 
imbedded in a Euclidean space of K dimensions. In this case the distances 
can be reconstructed on the basis of just NK Cartesian coordinates, an 
(a = 1, 2, --- , K) and the generalized Pythagorean theorem 


(40) A — (> (X35: a Rori’’. 


(39) Di, = log ( 
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Thus an economy of description is possible in cases with K < (N — 1)/2. 

Torgerson has presented a general method for determining a set of 
orthogonal coordinates, given a set of distances [25]. This procedure, as 
modified by Messick and Abelson [18] is as follows: Starting with the VN X N 
symmetric matrix, Dss , of interstimulus distances, an N X N matrix, 
Bss , of scalar products of vectors from the centroid of the system of stimulus 
points to all of points, S; and S, , is computed from 


2 1 S\2 2 
(41) Bh = oe »» (D3)? + ONT et (Di, ie 9 (Dix) eo on » » (D>,)’. 


Young and Householder [27] have shown that, if this matrix is positive 
semidefinite (that is, if the points can be imbedded in a K-dimensional 
Euclidean space), it may be factored so that 


(42) Bss = Xis°Xis + X25: Xis fos +} Xxs‘Xks ’ 


where X,5 represents a N X 1 matrix (or column vector) giving the coor- 
dinates X,; for all stimuli S,; on the one dimension a, and where X‘,, is the 
1 X N transpose of that matrix. 

Now there are infinitely many possible factor decompositions of the 
form given in (42), each one corresponding to a different orientation of the 
orthogonal coordinate system in K-space. Which one of these is selected, 
however, can be a matter of arbitrary stipulation, since they all yield the 
same matrix Bs, and since the interstimulus distances, as given in (40), 
are invariant under orthogonal transformations of the coordinate system. 
In practice the individual dimensions (or factors) may be extracted in such 
a way that X,s5 accounts for the largest possible variance among the original 
distances, X.s5 for the largest possible variance in any direction orthogonal. 
to the first dimension, and so on. In this way factoring may be terminated 
when the ability to reconstruct the original transition probability matrix is 
no longer significantly augmented by extraction of further dimensions. 
Various procedures for factoring Bs; in this way are available ([24], pp. 
149-175, 473-510). 

Exactly the same operations may be applied to the matrix of R-R tran- 
sition probabilities, Pez , to obtain a symmetric matrix, Dre , of inter- 
response distances. The computation is implemented by the N’ analogues of 
(39), namely, 


chee aie Ne oa eg 
(43) DE = ~log (Ee .) : Pai .) : 


Ss 8 
Cp, ilatioe r mae P 





Once again, if the \’ response points can be imbedded in an L-dimensional 
Euclidean space, the distances may be reduced to NL orthogonal coordinates 
Xf (@ = 1, 2, --- , L) such that 
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(44) ie = {0 (Xe — Xp)"}”. 
B 
The actual computation of the coordinates X,i will again be carried out by 
factoring a scalar product matrix Br so that 
(45) Brr = Xir-Xin + Xor'Xin + +++ + Xie‘Xirz ’ 


where Xz is the column vector containing the coordinates for all responses 
on dimension £. 

By substituting (40) and (44) into (30) and (31), (11) may now be stated 
in the more explicit form 


W: exp a om (xe = by ial 
Pa ~ |S Wie -(D@n— zal) 
WE exp — (0 (XR — XA)*}"” 
Xo WE exp — (2 (Kh — XB" |’ 





(46) 





where the bracketed expressions represent matrices generated by allowing 
the indices 7 and k to run from 1 to N. Thus the complete set of N’ S-R 
transition probabilities can be predicted on the basis of 2N weights, (K + L)N 
coordinates, and the permutation matrix, J, corresponding to the particular 
S-R assignment enforced. 


Problems of Estimation 


Certain practical difficulties arise in connection with the determination 
of the weights and coordinates owing to the fact that the probabilities, P;, , 
are never known exactly but only estimated from the experimental data. 
The purpose of this section is to propose some approximate procedures which 
can be used when limitations on the number of subjects or the number of 
trials make these necessary. 

First, with respect to the weights, the left-hand member of (35) will be 
extremely unstable, in a statistical sense, when P;, and P,’ are small. A 
technique useful in alleviating this difficulty is the following: After the 
stimulus weights have been estimated in a preliminary way by (36), each 
row, 2, of the matrix of quantities W%/W* given in (35) may be multiplied 
through by the tentative estimate for the corresponding W* . In the resulting 
matrix, each of the N entries in column k will be an estimate of W‘, . The final 
estimate may then be taken, for each column, as the median entry in that 
column. Exactly the same technique may be used to refine the response 
weight estimates. 

In the estimation of distances the difficulty stems from the use of the 
logarithmic transformation of (39) and the consequent fact that, if Pi) 
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and P,,;) are small, a slight variation in these leads to a large variation in Dj . 
The admission of such extreme instability in the determination of large inter- 
stimulus distances will have a disruptive influence on the factor solution 
used in obtaining the stimulus coordinates. 

One might be inclined to redefine the distance between two widely 
separated stimuli, S, and S, , as the sum of distances over some connected 
path of smaller distances between them. Thus, in Fig. 2, it might be supposed, 
as an approximation, that 


pl = D3, “++ Ds. + D>, + Die ° 





ee, 














FIGURE 2 


Different ways of approximating a large interstimulus distance by a sequence of smaller 
interstimulus distances. The circles represent stimuli in a two-dimensional psychological 
space. 


The problem here is one of choosing between alternative paths such as 
I and II. However, such a selection should be guided by two general rules. 
First, a relatively direct path should be chosen. That is, the sum of distances 
over the path, >> D, should be small. Second, a path should be chosen which 
does not contain any relatively large (and therefore unreliable) distances. 
That is, the largest distance in the path, max(D), should be small. Path I, 
then, is objectionable on both of these grounds. 

Combining these rules, every distance in the matrix Dss may be rede- 
fined as the sum, >, D, of distances over that connected path for which the 
product 


(47) max(D)->, D is minimum. 


The small distances will generally remain unchanged during the re-estimation 
procedure. The large distances, however, will be subject to substantially less 
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unsystematic error. The systematic error will presumably be somewhat 
augmented, however. The technique given here is not the statistically exact 
one which would have to take into account the number of events upon which 
each probability estimate is based. 

Finally the application of the model is impeded insofar as it holds only 
for a given stage of learning, that is, over spans of trials for which the P;,., 
remain relatively constant. The weights and distances could be estimated 
with greater reliability if they were based upon probabilities averaged over 
the entire learning session. 

Now it is an implication of the model that the ratio of any two distances 
must be invariant over learning. Therefore some function, g, of time, ¢, 
exists such that 


(48) Dir = g(t)-Dir ; 


where D%,,,, is the psychological distance between S; and S, at some given 
time, ft, as defined by (39), and where Dj is some fixed distance between these 
stimuli which does not depend on t. 

The fixed distances, D;, , contain an arbitrary multiplicative constant 
which may be so adjusted that 


1 r 
(49) 7 [ g(t) dt = 1. 
It then follows that 
" 
(50) Di. = 7 [ Dincey dt. 


However, not only is this average computationally impractical, but it also 
heavily weights the large estimates for the D%,,,, , which are based on small 
numbers of transitions and, so, are extremely unreliable. 

Since the present model pertains to generalization at any given time 
rather than to the course of learning over time, no stipulation has been made 
regarding the function g(t). However, the usual learning results indicate that 
the P;, (for i + k) decline rapidly at first and then more slowly, from initial 
values of W‘,/N towards their lower asymptotic bounds. This suggests that 
the average given in (50) may be more reliably approximated by the average 


Z 
(51) bt. SS — log ral exp [— Dir] dt — o*} 
0 


so long as the interstimulus distances do not cover too wide a range. This 
average, although it does discount the large, unstable distance estimates, 
requires the inclusion of a constant, C’*, in order to subtract out the essentially 
random transitions which account for the P;, values of W‘%/N before the 
subject has acquired any knowledge of the prevailing S-R assignment. 
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By transposing terms in (51), then, it may be seen that, for purposes of 
estimation, (30) is to be replaced by 
(52) Sw a [exp (—Dix) + C*] 


“= > W8lexp (— DS) + C8]’ 





where P;{ applies to the entire learning session. Following through the 
derivation for (39), the psychological distances are found to be approximately 
given by 


Ss --PE\'2 
(53) ix & —log {a +C 3 ee *) - ot}. 
Likewise, it is assumed that there exists a constant, C”, such that 
P?,-PZ\? 
(54) te & —log {i + es sa a — on. 


In order to use (53), it is necessary to obtain an estimate for C*%. This 
may be done by calculating a set of stimulus coordinates under the assumption 
that C* = 0. If, then, the quantities 


Pie) Pun)" . ee ae 
(Fe “Pets ’ (i, k - 1, 2, ’ N) 


are plotted as a function of the distances reconstructed from (40), an asymp- 
tote, c, for large distances may be estimated by drawing a smooth curve 
through the data-points. If the responses have been selected so that P* = 0, 
then C* = c/(1 — c). Exactly the same method can be used to estimate 
c* if P* = 0. 

The practical advantage of the counterbalancing technique mentioned 
in connection with (20) and (21) results from the following finding. In the 
application of (53) and (54), if P” and P* are small, they may be assumed 
equal to zero. The only appreciable consequence of this procedure appears 
to be a slight inflation of the estimates, respectively, for C* and C®. 

With regard to the estimation of the stimulus and response weights, 
the constants C’* and C” drop out in the derivation of (36) and (37). Therefore 
the weights can be approximately estimated from these equations, as they 
stand, even though the S-R transition probabilities are averaged over the 
entire learning session. 


A ppendix 


Since, in all the experimental work to be reported N = 9, it will be useful 
to exhibit permutation matrices, J, , having the property that over all 
subjects, every pair of responses is assigned to each pair of stimuli the same 
number of times. In order to do this, it is convenient to introduce 0, , the 
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3 X 3 null matrix;1,; , the 3 X 3 identity matrix; H; , given by 


. = 
is = 


and K;', the 3 X 3 matrix with elements K,, = 1, if g = i andh = j, and 
K,, = 0, otherwise. If for positive integers r, 


3 O; O; 


Jise-11 = Jisri = . H; 03], 
O, O; H; 
[O, I, 0, 
Jior-s1 = Jtor-21 = it O; I; |-Jis--n ; 
3s O; O;| 
 e ; = 
Jiorss1) =| O; HH; O,|-| Ky? Ki? K3?|-Jior-si » 
O, O, Hj] LK;’ K;’ K;° 


then the permutation matrix for any subject m (starting with some arbitrary 
initial assignment, J,.;) is given by 


Be a Jimi *Jta-11° Fai ‘Jin ° 


The first 36 matrices formed in this way assign every pair of responses to 
each pair of stimuli just once. The second 36 assign the same pairs of responses 
to the same pairs of stimuli with the orders reversed from those of the first 
36 assignments. Thereafter the same assignments are repeated with every 
succeeding 72 matrices. Thus, if NV = 9, M can be any multiple of 72. Indeed, 
since P #; will generally not be far from P,* , a satisfactory degree of counter- 
balancing can probably be obtained with only the first 36 assignments. 
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For continuous distributions associated with dichotomous item scores, 
the proportion of common-factor variance in the test, H?, may be expressed 
as a function of intercorrelations among items. H? is somewhat larger than the 
coefficient a except when the items have only one common factor and its 
loadings are restricted in value. The dichotomous item scores themselves are 
shown not to have a factor structure, precluding direct interpretation of the 
Kuder-Richardson coefficient, rg.p , in terms of factorial properties. The 
value of rg_z is equal to that of a coefficient of equivalence, H?s , when the 
mean item variance associated with common factors equals the mean inter- 
item covariance. An empirical study with synthetic test data from populations 
of varying factorial structure showed that the four parameters mentioned may 
be adequately estimated from dichotomous data. 


Factor structure and test reliability, r,, , are closely connected in testing 
theory. In Cronbach’s [2] joint treatment of these two topics he distinguishes 
four kinds of reliability coefficients: (1) stability, (2) stability-and-equivalence, 
(3) equivalence, and (4) hypothetical-self-correlation. Each, he says, is 
uniquely characterized by the particular factor score variances assigned to 
error variance, o2 . Thus the coefficient of equivalence is defined by a general 
formula, r;, = 1 — (02/07), where o7 is the variance of total test scores and 
o. includes both the variance of specific factors for each item and the residual 
error variance. Stated differently, the coefficient of equivalence tells “the 
degree to which the test score indicates the status of the individual at the 
present instant in the general and group factors defined by the test’ [2]. 
A definition of another coefficient, such as that of stability, would employ 
the same general formula with a new specification of error variance. 

In a later analysis of the Kuder-Richardson ({7], formula 20) coefficient 
of equivalence, rx_z, Cronbach [3] has stated in greater detail how total 
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Training Research Center, Randolph Air Force Base, Randolph Field, Texas. Permission 
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University Graduate School. The computational assistance of Mr. Norman Miller is 
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test variance depends upon the factor loadings of the individual items. He 
repeats his earlier argument that a general coefficient, a, of which rg_z is a 
special case, is the proportion of test variance due to common factors when 
a special assumption holds: the mean common-factor variance within items 
must equal the mean interitem covariance. This conclusion is weakened by 
the recognition that the assumption does not hold for an interitem correlation 
matrix with rank greater than one. Thus rxg_, is the proportion of common- 
factor variance only when there is but one common factor. Cronbach pre- 
sumes that a will closely estimate this proportion even with multifactored 
cases unless the test contains distinct clusters. 

The present paper argues that a strict interpretation of the factorial 
hypothesis requires a reanalysis of Cronbach’s notions. This reanalysis 
requires a separate treatment of factorial structure and of reliability when- 
ever dichotomously scored items comprise the test under analysis. In brief, 
a factor structure of such items and the related factor structure of the total 
test exist for continuous distributions underlying each dichotomous distri- 
bution. These structures do not exist for the dichotomous distributions. 
On the other hand, since the scoring of items and of the total test employs 
the dichotomized scores, reliability measures must be properties of the 
dichotomous distribution. 

A consequence of this reasoning will be the specification of two distinct 
a coefficients: a for the continuous case, and a» or rx_pz for the dichotomous 
case. The conditions under which a will equal the proportion of common- 
factor variance in the total test variance, H’, will be examined, extending 
Cronbach’s statement on the single-factoredness requirement. The coefficient 
Qs OF ’x-p Will be shown to approximate a coefficient of equivalence, H , 
for a test with dichotomously scored items. Thus, contrary to Cronbach’s 
belief, rx» will never estimate common-factoredness, even for single-factored 
tests, and will estimate a coefficient of equivalence only under special con- 
ditions even more restrictive than single-factoredness. 


Theory of the Continuous Case 


Thurstone’s multiple-factor theory has as its basis the hypothesis ((8], 
pp. 69-74): 


(1) sip = 8 QimXmp + biYin + Ci Nip » 
m=1 
where 
8;, = standard score for person p on item 7 of a test (¢ = 1, 2, --- , n), 
d;m = loading of common factor m on item z (m = 1, 2, --- , 7), 
Zmp = Standard score for person p on common factor m, 
b; = loading of item 7 on a factor specific to item 2, 
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e; = i — >> a2, — b? = loading of error on item 7, 
m=1 
and y,, and 7;, have definitions analogous to that for z,,, . The np , Yip ; 
and 7;, are independent of each other. The values of a,,, , b; , and e; are 
parameters of the test. 

The item raw scores, X;, = 0;8;p + wu; , Where u; and o; are the population 
mean and standard deviation of the x;, , may be summed to give total test 
scores 
(2) a p X ip on ya (= o.Antne) + z. o:bYin + > TC Nip + p> Hi - 

t=1 m=1 t=1 i=1 i= i= 

On a second administration of the test, the standard scores and total 
test scores will be given by 


(3) Sip = bs QimUmp + DiYin + CsNip ? 
m=1 
and 
(4) a> — pe ( €.Ainny) + pe 7:D:Yip + +S TLiNtp + PR Bi * 
m=1 t=1 t=1 i=l t=1 


Equations (3) and (4) differ from (1) and (2) only in that 7;, is a random 
variable changing in value to n/, on the second administration of the test. 
The n/, are independent of all previous variables. 

Following Wilks ({10], pp. 33-35) and remembering that the variances 
of the ny 5 Yip » Nip ANd nf, are all unity lead to 


r n 2 2 ® 
(5)  var(T) = var(T") = >) ( odin) + Do oibi + & ove? . 
m=1 t=1 t=1 t=1 
Equation (5) is the same in substance as Cronbach’s [2] equations (2) and (3). 
Similarly the covariance of total test scores on successive administrations is 
given by 


n 


? n 2 
(6) cov (7, 1") = (3 eiain) + Dot 
m=1 t=1 i=1 

The ratio of cov (7, T’) to var (7) is the hypothetical self-correlation 
rrr of the continuous scores 7’, and T% . It is also seen from (5) and (6) to 
be the ratio of variance contributed by common and specific factors to total 
test variance. 

The relative contribution of each separate common or specific factor to 
var (7’) is 


(7) Ai = (= oitn) Fi var (T); 


i=1 
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(8) B? = ob; /var (T). 


AZ and B* may also be called the squared factor loadings of common factor 
m for the total test and of specific factor z for that test. 

The relative contribution of common factors to var (7) has previously 
been termed H°: 


(9) = > AL= O (Sean) / var (7). 
m=1 m=1 i=1 
H? may also be considered a coefficient of equivalence for the continuous 
case since 1 — H” includes both specific variance and residual variance. 
It may be of note to remark that with a centroid solution having only 
positive first factor loadings, Thurstone has shown )-*_, a;, = 0 for m ¥ 1. 
This implies that Ay = 0 for m ¥ 1 and, consequently, that H? = A? centroid. 
Equation (9) must be revised to define H’ in terms of parameters more 
readily determined than the a,,,’s. By definition, r;; , the correlation coefficient 
between the item 7 and item j in a single test administration, is the expected 
value of (s,,s,;). This definition coupled with (1) yields 


(10) Te = D2 Giltin » (i ¥ 9); 
m=1 
(11) i > Aindin + bi +e = 1, (¢ = j). 


The expressions (10) and (11) in turn imply that (5) may be rewritten 


n 


(12) var (T) = >> D> o;0;7;; - 
t=1 j=1 
Equation (12) will be used to restate the denominator of (9). Reanalyzing 
the numerator of that equation, define r* as the correlation coefficient 
between the items 7 and 7 on successive administrations of the test, excluding 
the contribution of specific factors. In this case 


(18) rz, — Vij — > AimAjm ’ (a a )); 
m=1 

and 

(14) f= hi = > ass (i = 5), 


where hf? is the proportion of common-factor variance to total variance for 
item 7. Equations (13) and (14) lead to the conclusion 


(15) > 63 o.0in) = > 3 ;0,73; - 


m=1 \¢=1 t=1 j=1 
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Equations (12) and (15) may now be employed to rewrite (9) as 


(16) H? = Zz z o,o,;r* _ z 0;0;7;; e 
i=1 j=1 i=? j=l 
This equation is directly applicable to situations in which the o; , o; , and 
r;; are known but the factor loadings themselves have not been determined. 
The coefficient a for the continuous case may now be compared with H’. 
Cronbach’s equation (24) ({3], p. 305) in our notation is 


n n n n n 
(17) Ce ee i ae weir | z O;O;" i; - 
i=1 j=1 i=1 j=1 
ixi 





For a to equal H’, the proportion of common-factor variance to total test 
variance, the numerators of (16) and (17) must be equal. This requirement 
reduces by means of (13) and (14) to 


(18) Doh + DD ovoirt; = ae 2 2 CPD ; (it # j). 


i=1 t=1 j=1 t=1 j=1 
Simple algebraic manipulations lead to an equivalent condition: 


(19) - oh; — (3 > coir’) /'n = 0. 


i=1 =1 j=1 
It should be noted that the case 7 = 7 is not excluded in equation (19) as it 
was in (18). 
Finally by (14) and (15), (19) leads to 


(20) = | = o:0im — (= cain) | = 0). 


m=1 i=1 i=1 
But this is equivalent to the assertion that the standard deviation of the 
o;4;, must be zero for every m, or equivalent to the assertion that a,;, = 
k,,/o; , where k,, is constant over 7 but may vary over m. Combining this 
result with (13) and (14) gives a general term for the reduced correlation 
matrix: 


r 
2 
= DK / ows. 


m=1 


That matrix would then have rank 1, and a single common-factor solution 


with loadings equal to 
eats. 
m=1 


In summary of this point, Cronbach’s requirement of a single common 
factor for equality of H’ and a may be expanded to require that each item 
have a single common-factor loading inversely proportional to its ¢; . Although 
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this condition is both necessary and sufficient, H? and a may be almost 
equal without satisfaction of the condition. For estimation procedures 
employing dichotomous data, the restrictive assumption o; = o for all 7 
will become necessary. In that case o will replace o; and o; where they occur 
in (2) through (20). In particular H’ will be defined in a manner almost 
equivalent to Jackson and Ferguson’s equation for rr7- ([6], eq. 67) 
(21) H® = » 2th z Dri 

t=1 j=1 i=1 j=1 
in place of that given in (16), and a@ will be given by 
(22) a= — 2 ar pr . 


n=— 1 fi jer i=1 j=1 
(ii) 





Similarly the condition (20) required for equivalence of H’ and a will reduce 
to a requirement of a single common factor with a constant loading equal to 


ax (k3,/0°) = 4/ Sg. 


Theory of the Dichotomous Case 


Suppose that a population proportion P; of persons pass item 7 of the 
n-item test previously discussed. Can we describe the standard scores cor- 
responding to 0 (fail) and 1 (pass) scores by an expression having the form 
of (1)? All persons failing the item will have s,, = —P;/VP,(1 — P,); 
all persons passing the item will have s;, = (1 — P,)/VP. — P,). Now 
any nontrivial application of (1) implies the existence of at least one nonzero 
a;m , two distinct values of z,,, , a nonzero e; , and two distinct values of 7;, . 
Consequently (1) requires that there be at least three values of s;, in the 
simplest case, occurring when a;, = V2/2 = ¢; , tmp = +1, and n;, = +1. 
Since the z,,, and 7;, are independent, s;, will take on one of three values, for 
various persons, ~/2, 0, and — +/2 rather than the s,, values given above. 

But this simplest case completely refutes the proposition that dichoto- 
mous item scores have a factorial structure of their own. The factorial hy- 
pothesis of (1) will always imply the existence of three or more different 
score values on each item. One should never, then, suppose that dichotomous 
distributions may be generated by (1). Correspondingly, one should never 
factor analyze a product moment correlation matrix based on 0 and 1 scores 
(i.e., a matrix of phi coefficients). Although this matrix may be expressed as 
a product of a “factor” matrix by its transpose, that “factor” matrix will 
have no direct application to anything in factor theory. The misbehavior 
of factor loadings based on phi coefficients has been previously observed 
[5, 9]. The theoretical basis for this misbehavior has not been treated in the 
present manner. 
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In denying the factorial interpretability of dichotomized scores as 
basic data, we do not completely foreswear an interest in the dichotomous 
case. As a later section will show, dichotomous data can be used in such a 
way as to estimate parameters, such as H® and a, which are characteristic 
of the underlying continuous distributions. 

We now turn to the matter of reliability determinations for total test 
scores obtained by summing pass-fail item scores. The hypothetical self- 
correlation, rrr, , may be defined as 
(28) = rrr-g = cove (T, T’)/vare (T) = 2) dy cov ,, / var 6 (T), 

i=1 j=1 
where the ® subscripts are employed to indicate that dichotomous item 
scores are employed, causing all interitem correlations to be phi coefficients, 
®,;;. 

Use of the facts that 


CE; => VP erg P;) and Go; = VP;(1 — Pi), 
helps (23) become 


n 


(24) ‘rr, = ze > VP ~ P)P(l = Pie., / varg (T). 


t=1 j=1 





This hypothetical self-correlation may be called a phi coefficient analogue of 
rrr . [tswalue will be less than r77- and, unlike rr7z. , it will be sensitive to 
changes in P; from item to item. 

To obtain a coefficient of equivalence, Hj , for the dichotomous case, 
any quantity attributable to specific factors in the underlying item scores 
will be excluded from the numerator of (24). This is done by defining Hj as 


Q) -Hh= SY VP PPA — Pe, / vare (1, 


i=1 j=1 





where ®4 = %;; whenz # j, and 6% = &(h7) when 7 = j. &(hj) is simply 
the phi coefficient obtained by entering the Chesire, Saffir, and Thurstone [1] 
tetrachoric correlation charts backwards, using r;; = h? and P; to obtain 
the fourfold table necessary for computation of the phi coefficient associated 
with that h? . HZ is not a measure of common factoredness, but factor structure 
plus the P; and P; have complete control over it. It is the idealized test-retest 
correlation for total test scores based on pass-fail data when the self-correlation 
of items has been reduced to exclude specific factor contribution to the under- 
lying distribution. Hj may also be called the correlation between tests which 
are matched item by item for common-factor structure. 
An analogue of a, as or rg_z , has already been mentioned: 





n 


(26) as =rx-z = ara ba . VP(1 ~ Phe = Pb, / vary (T). 
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A comparison of (25) and (26) shows that rx_pz is a coefficient of equivalence 
when 

(27) Pl — P,)&* = VP. — PJP, — P,)%,; , (i ¥ j). 
This condition is analogous to requirement (18), and both are equivalent 
for the case of parallel tests to Jackson and Ferguson’s [6] assumption that 
the mean interitem covariance between tests be equal to the mean interitem 
covariance within tests. 

In the special case where P;(1 — P;) is constant over all z and o; is 
also constant over all z, the one-to-one correspondence between r;; and 9,; 
will imply that the modified condition (20) following equation (22) is necessary 
and sufficient for rx¢_z to be a coefficient of equivalence. 











The Estimation of H’, Hg , a, and rx-_p 

A crude estimation procedure for the four parameters, H” , H¢ , a, 
and rx_pz , is to replace all individual components of (16), (25), (17), and 
(26) by their estimators. Special problems arising in the application of (16) 
and (17) when only dichotomous data are available are insoluble until some 
assumption about the o; and o; values for the underlying continuous dis- 
tribution is made. 

Obviously the P; and P; values of the multivariate dichotomous popu- 
lation may be assigned independently of the a; and o; values for the underlying 
continuous population. Then the pass-fail splits defining the ;; will be 
independent of the c, and o; , depending only on the P, , P; , and 7,; . Thus 
neither sample values nor parameter values from the dichotomous population 
alone will give any information about the o; and og; . 

Any arbitrary assumption about the o,; and o; would permit estimation 
of H’ and a with (16) and (17). In general the assumption o; = o for all 7 
leading to (21) and (22) seems most satisfactory. Therefore, it will be em- 
ployed throughout the remainder of this paper. 

The estimation of H” and a from dichotomous data may be performed 
for multivariate normally distributed underlying item scores, at least, by 
replacing each population r;; in (21) and (22) by a tetrachorie correlation 
coefficient obtained from the Chesire, Saffir, and Thurstone tables. Then h; 
is estimated as the highest tetrachoric coefficient of column 7 in the tetrachoric 
matrix. Improved estimation procedures would no doubt result from an 
attempt to obtain maximum likelihood estimates for H’, Hj , a, and rg_e . 

One may well ask the justification of estimating H’ and a when sample 
item scores are all dichotomous and the alleged continuous underlying 
distributions seem but a convenient fiction. Three answers may be given 
to this. (a) There is no other way to make a statement about the proportion 
of common-factoredness in such tests. (b) The coefficients H’ and a are 
upper bounds on H¢ and rx_» , showing the degree of improvement in the 
test which could be obtained by continuous rather than dichotomous scoring 
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of each item. With such materials as tests of physical strength, it may be 
quite feasible to increase substantially the coefficient of equivalence by a 
change in scoring methods without changing the test items employed. (c) The 
sensitivity of rx_z to variation in P; is well known. Many attempts to control 
test homogeneity have centered on control of the P; . Comparisons between 
H’ and H¢ and between a and rx_,» will serve to emphasize that a test with 
high rx_z because of homogeneous P,; may nevertheless have less factorial 
homogeneity than a test with a lower rx_,p . 


A Sampling Study 

The unit of data for this study is a sextuplet of numbers representing 
item scores on a six-item ‘‘test’”’ given to a hypothetical “subject.” Eight 
samples of 500 such subjects’ item scores were obtained, one sample from 
each of eight populations defined by specifying the factor loadings of the 
items in the test associated with that population of scores. The factor loadings 
selected are presented in Table 1. 

TABLE 1 


Item Factor Loadings and Test Parameter Values for Eight Contrived 
Populations of Scores from Six-Item Tests 





Loading for Item Humber 




















Population 1 2 3 4 5 6 
I vey eTT46 «67746 «67746 TTHG ««CTTRG = wTT46 
an -8660 .8660 .8660 0 ty) ° 
II ayo ° ° 0 .8660 .8660 .8660 
ay) +9085 .9045 OO 0 ft) ° 
TO ayo ° O .9045 .90h5 0 ° 

13 0 0 0 0 29045 =. 90K5 
rey 5884 .5884 0 5884 .588k ° 
IV ayo © 586k .5884 0 .568h .5884 
843 5688 O .5684 .5884 0 5884 
e421 -8321 = .8321 0 -8321 .8321 .0 
V aio t+) 4161 = .4161 0 4161 =. h161 
813 4161 #0 4161 .k161 80 4161 
VI a4) 25291 .5291 .5291 .5291 .5291 .5291 
VII ai. 3780 .3780 .3780 .3780 .3780 .3780 





ehk72 MA7T2 Osis AAT t) 
+2236 2236 0 .2236 .2236 


A 
ERE 


.2236 ts) +2236 .2236 «=O +2236 
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Given the a;,, and e; values for any population, we employed a table of 
random normal numbers ([4], Table 2) with » = 0, o = 1, to obtain a 
different set of 500 z,,, or x;, values for each factor and for error. Equation 
(1) was employed to obtain s,;, values for each person on each item. 

After all s;, values had been obtained for a test, the 500 scores for each 
item were dichotomized on the basis of a pass-fail criterion. This permitted 
determination of tetrachoric correlation coefficients for use in computing 
A’ and &, estimators of H? and a. The dichotomized scores were also employed 
in calculating #2? and #x_» . In the determination of H? and rg_,z , the sample 
proportions were used in place of population P; , introducing some inaccuracy 
in their values. 

The dichotomization of item scores was performed twice, once with the 
sample proportion of 1 scores, p, , fixed at .50 for every item in every test 
and once with p, variable from item to item. In the latter case p; ranged 
from .25 to .75, with p, = .25, p. = .35, +++ , ps = .75 for every test. Each 
of the dichotomizations led to a distinct set of estimates of parameter values. 


Results 


Table 2 presents a comparison of sample and population values of all 
four coefficients, both for the fixed p,; case and the variable p; case. Within 
the limits of the samples employed, our estimators are quite satisfactory. 
The mean constant error of any estimator across eight populations never 
exceeds .002 in absolute values, and no individual coefficient is in error by 
more than .058. 


TABLE 2 


A Comparison of Parameter and Estimator Values for a Six-Item Synthetic Test 








Population Fixed py Variable py 





mh 
|x 
ra 
Jar 


fs Fs meee [EF = Fp Fp mee Fee 
900 .900] .929 .91% .800 .840 .800 .822 | .929 .905 .768 . 

900 .720 | .685 .73% .779 .789 -623  .623 90h 689 «#ThQ«w768 

2900 «= «5K0 | .893 6579 «= «756 THB 454 KT! 897 .50% «751 «THO beh = 6403 
900.810] .911 ~.838 #=«.779 ~«# «.801 -69% .726 -890 .823 .766 .725 .661 .673 
900 .610 | .682 .801 .80% .783 .713 .702 | .89 .613  .787 .792 .668 .676 
+700 | .702 .680 .575 .575 .575 549 | .7le .670 .5h9 -5kO .She  .527 
500.500 | .558 .521 .369 431 -389 = .398 587 «.5h2 362 wA0 «360 «356 
2500 ««e50 | 2488 SoG i‘«w“ZT2 Siw 3588 +336 «30% oh16 328356309 36——«w HL 


faascabe., 
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In single-factored populations I, VI, and VII, H’ and a are equal, and 
Hg and rx_» are equal for the homogeneous p; case only. In some of the other 
populations these coefficients differ markedly, indicating that a and rg_p 
are poor approximations to coefficients of equivalence in those cases. A 
large discrepancy between sample values, HM’ and a, or 2 and 7x_z , appears 
indicative of multiple-factoredness or of wide dispersion of a single common 
factor’s loadings. The converse statement is not true. 

Unlike H’ and a, which are invariant under different dichotomizations 
of the same underlying scores, Hj and rxg_p are reduced by introducing 
heterogeneity of the p; . The estimators M3, and 7x_, also show this effect. 
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A method based on configural analysis has been given whereby test 
scoring techniques can. be evaluated to see if they have optimal validity. 
Configural analysis has also been used to show how three well known item 
scoring techniques, multiple regression, total score, and multiple cut-off, imply 
(for optimal validity) certain conditions on the answer pattern means. The 
method is illustrated by a worked example. 


The purpose of this article is to demonstrate a method whereby test 
scoring techniques can be evaluated to see if they have maximum validity. 
In a previous paper [3] a technique of pattern scoring of test items for the 
prediction of a quantitative criterion was presented. The basic notion used 
was that of a configural scale, defined as follows: given a test of ¢ items and 
a quantitative criterion, form all possible answer patterns and assign to each 
subject a score which is the mean criterion score for all subjects in his answer 
pattern. This set of scores is called a configural scale. It was shown that, in 
the analysis sample, of all possible ways of scoring the ¢ items, the configural 
scale provides the best least squares prediction of the criterion. It was further 
shown that the configural scale could be represented exactly by a polynomial 
function of the item scores if the items are dichotomous. However, the concept 
of the configural scale is not restricted to dichotomous items. If the items 
are polychotomous, the only change is that the number of possible answer 
patterns will increase. 


Theory 
A. The equav model 


The configural scale is defined as the set of answer pattern means, and 
can be represented by a polynomial function of the item scores. An example 
of the configural scale and polynomial equation for two items is given in 
Table 1. In Table 1 the answer patterns are designated by A, for the answer 

*We are indebted to Professor James G. Taylor for his helpful suggestions. 
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TABLE | 


The Configural Scale and Equav Scores for Two Items 











Answer pattern Answer pattern Criterion Equav scores 
frequency means 

Items n Cc M 
she XX XQ XQ 
Ay ie N Cy 1 1 4.4 
Ay NY ny Cc) 1 -1 1 -l 
A, Yh uy) Cc, 1 1 -1 -l 
Ao NN 0 Cyo 1 -1 -1. 1 
Equav regression coefficients 4, qd, d, dio 





pattern containing all yes responses, A, for the answer pattern containing a 
no to Item 1 and a yes to all other items, A, for the answer pattern with a no 
to Item 2 and a yes to all other items and in general, A, where r designates the 
items to which the subject has responded no. The frequency and the criterion 
mean of each answer pattern is designated by the same subscript; e.g., 1 
and (, are the frequency and criterion mean, respectively, of Ao , the yes-yes 
pattern. 

Each item is scored +1 for a yes response and —1 for a no response. 
Let u, designate the score for the kth item. Let X,; be a polynomial term, 
where j indicates the items which form the term. The score on X; is obtained 
by actually multiplying the item scores together; e.g., X, = ui, Xio = UWUe2, 
X23 = U,U2U; , and so on. These polynomial terms are called equav scores. 
As will be explained in detail later, eguav is used to refer to the model for a 
factorial analysis of variance with equal cell frequencies. 

The matrix M as shown in Table 1 is the set of equav scores whose 
rows are the 2‘ answer patterns and whose columns are the 2‘ polynomial 
terms. The general entry, m,; , equals (— 1)’, where g is the number of common 
items in the rth answer pattern and the jth polynomial term. X> is a dummy 
term with a score of +1 for every individual, to allow for the constant term 
in the polynomial equation. 

In Table 1 the equav predictor for the criterion score of the 7th individual 
is 
(1) C; = AX io + d,X i, + dX ia + dy2X i12 ’ 
where C; is the predicted criterion score for the ith individual, 

X,; is the score of the 7th individual on the jth polynomial term, 
d; is the least squares regression coefficient for the jth polynomial 
term. 





en os 


Co me 
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The exact solution for the 2‘ regression coefficients can be obtained as 
follows. Let Z be an N by 2‘ matrix whose general element z;; is the equav 
score of the 7th individual on the jth polynomial term. Z is an expanded 
form of M where the rth row of M is repeated n, times. Let C be an N-rowed 
column vector, where c; is the criterion score of the 7th individual. Let d 
be the 2‘ X 1 column vector of regression coefficients. Then 


(2) d = (Z'Z)"Z'C 


is the set of regression coefficients which gives the exact least squares fit. 
The predicted score C; of the ith individual will be the mean of his answer 


pattern. 
Let n be the diagonal matrix of answer pattern frequencies. Then 
(3) Z'Z = M'nM 
and 
(4) Z'C = MC, 


where C is the 2‘ by 1 column vector of answer pattern means. Substituting 
(3) and (4) in (2), 


(5) = (M'nM)"M'n€. 
Since M = M' and M™ = 2°'M, 
(6) d = M'n"M"Mn€ = MC = 2°'MC. 


Thus, each regression coefficient is equal to the algebraic sum of the 2° 
criterion averages divided by 2°. 

Scoring each item alternative +1 or —1 is exactly analogous to a two- 
level factorial analysis of variance model when all cells (answer patterns) 
have equal frequencies. For this reason the term equav is used to denote 
this method of scoring the polynomial terms. In previous papers [1, 3] the 
items have been scored +1 for a yes response and 0 for a no response. This 
leads to considerable difficulty if item scores are arbitrary. A reversal of an 
item score involves nonlinear transformations of the polynomial terms which 
alter the absolute values of the regression coefficients. 

Equav scoring has certain algebraic advantages (such as M = M’ and 
M~* = 2°'M). The most important advantage is that the absolute values 
of the regression coefficients are invariant no matter what item scores are 
reversed. The proof follows: 

Reverse the equav score on the kth item. Every —1 becomes +1, every 
+1 becomes —1. In other words, —u, is substituted for u, . This amounts 
to multiplying each column of M by —1 if u, appears in that polynomial 
term. In general, reversing any set of item scores is equivalent to multiplying 
the appropriate columns of M by —1. 
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Let H be a 2‘ by 2‘ diagonal matrix containing —1 in each diagonal 
cell corresponding to the appropriate column of M. All other diagonals 
contain +1. Let P be the M matrix after the item scores have been reversed. 
Then P = MH. Let e denote the set of regression coefficients for the poly- 
nomial equation using the reversed scores. 

Substituting in (5) 


(7) e = (P'nP)'P'nC; 

(8) e = (H'M'nMH)"'H'M'nC; 
(9) e = H'M'n'M"'H"HMnC; 
(10) e= H'M"'C. 

Since 

(11) H"' =H and M'=2''M, 
(12) e = H(2"'MC). 

Therefore 

(13) e = Hd. 


Premultiplying d by H simply reverses the sign of certain of the d coefficients. 
This proves that the absolute values of the equav coefficients are invariant 
under reversal of item scores. 

So far, only matters of algebraic and computational convenience have 
been discussed. However, it is possible that certain methods of scoring items 
may be most appropriate with certain kinds of test content; i.e., equav 
scoring may be most appropriate for personality tests, and zero-one scoring 
may be most appropriate to aptitude and achievement tests. 

Suppose that certain of the 2‘ regression coefficients are zero in the 
population. Then (6) does not give exact least square estimates of the non- 
zero coefficients for the sample. A general solution for this case where certain 
coefficients are assumed to be zero has been given in ((3], equation 38). 

As an example, consider the special case of linearity where all the co- 
efficients for the nonlinear terms are zero. Then 


(14) C = Wy + WX, + WX. + W3X3 + WX, « 

Let Z, be the N by t submatrix formed by taking the first ¢ + 1 columns 
of Z. Then 
(15) w, = (Z1Z,)"ZIC, 


where w, is the column of ¢ + 1 linear regression coefficients. Let K, be the 
2' by t + 1 submatrix formed by taking the first ¢ + 1 columns of /. Then 
Z'Z, = K'nK, and Z'!C = K'nC. Therefore 
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(16) w, = (K'nK,)'KinC. 


Note that since K, is rectangular, no simple inverse exists and equation 
(16) cannot be further simplified. 


B. Restrictions on the answer pattern means 


Any method of scoring the ¢ items which yields optimal validity in the 
population and yet uses fewer than 2‘ parameters imposes certain restrictions 
on the population answer pattern means. In this paper, restrictions imposed 
by three well known test scoring methods: multiple regression, total score, 
and multiple cut-off, are considered. 

Table 2 summarizes the necessary and sufficient conditions (in the 
mathematical sense) for each of the three scoring techniques to yield optimal 
validity. The restrictions on the equav coefficients amount to definitions in 
the case of multiple regression and total score. From these definitions a 
number of restrictions on the answer pattern means can be derived. 


TABLE 2 


Conditions for Optimal Validity 








Scoring Method 





Multiple Total Multiple 

Regression Score Cut-off 
Equav All non-linear 1. All non-linear Only one co- 
Coefficients coefficients are coefficients efficient 

zero are zero differs from 


the others in 
2. All first-order absolute value 
coefficients are 


equal 
Answer The sum of com- 1. The sum of com- Only one mean 
Pattern Means plementary plementary answer differs from 
answer pattern pattern means is the others. 
means is equal equal to a con- 
to a constant, stant, 2d 
2d 0 


2. All answer pat- 
terns whose sums 
of item scores 
are equal have 
equal means. 





One of the most useful restrictions on the answer pattern means in 
the linear case is given in Table 2. First, let us define complementary answer 
patterns. Answer pattern A, is complementary to A,, , if and only if, every 
item response in A, is reversed for A,, . For example A,. (NNYY) is the 
complement of A;, (YYNN). In our four-item case 


(17) C, = do + dX, + d,X2 + d;X3 + AX, ’ 
(18) C,, = do + diX{ + d.X + d3X$ + dX). 
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For the equav model, X, + X{ = X, + Xi = X, + Xf = X, + Xi = 0. 
In general (X + X’) = 0. Therefore, 


(19) C,. + Gs = 2d + (X + X’) (d, + dy + ds + dy) — 2d, . 


The additional restriction for the total score case, that all answer patterns 
with the same total score have equal means, can be derived as follows: By 
definition, the first-order coefficients are equal, i.e., 


14 — 1, — a, — a. 


Therefore, 


c, = dy + d,X, + d,X > + d,X3 + dX, 
ss = d, + OX, +X, +X + XO. 


Thus all answer patterns with the same sum of item scores (X, + X_. + 
X; + X,) will have the same mean. 

The basic definition of the multiple cut-off is that only two scores are 
used. The subjects in the all-yes answer pattern are assigned one score; the 
subjects in all other answer patterns are assigned the other score. This 
scoring method implies that for optimal validity all except one of the answer 
pattern means should be equal. Without losing generality, it can be assumed 
that the unique mean is C, . Let C denote the constant means for all other 
answer patterns. From (6) for calculating the regression coefficients from the 
means it follows that 


(21) dy = [Cy + (2° — 1)C]/2', 
(22) d, = (Cy — C)/2", 
(23) d, = (C, — C)/2', 
and in general 

(24) d; = (Cy, — C)/2"'. 


Thus, in the multiple cut-off case, all coefficients but one will have the same 
absolute value. 


C. The F ratio tests 


How any hypothesized relation of a specific set of item interactions to 
the criterion can be tested by means of the F ratio is shown in [3]. Of course, 
the usual assumptions of normality and homogeneity of variance must be 
met. 

The general F ratio test is as follows: let », be the configural validity, 
r) be the validity of any specified scoring method, and v, be the number of 
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sample statistics that must be calculated. Then 


2 2 t 
pas Ne eR To N ith 2 ). 
=) sii (2 _ ae — 


An even more general formula is given in ([3], equation 34). 





Worked Example 


In order to illustrate the method an example was tonstructed. Five 
hundred scores were drawn from a table of normal random deviates. These 
scores were then transformed so that the universe mean was 5 and the uni- 
verse standard deviation was 10. The frequencies for each answer pattern 
were calculated by fixing the p values of the four items at p, = .3, pp = .4 
Ps = .o, Ps = .6 and assuming all items to be statistically independent. 
The 500 scores were assigned at random to the sixteen answer patterns 
according to predetermined frequencies. 

To the artificial data described above, a linear systematic component 
was added. The following arbitrary values were assigned: d) = 22,d, = —1, 
d, = 5,d,; = —7, d, = 9. Using these values in (14) gave the systematic 
component that was added to each answer pattern mean. The column in 
Table 3 labelled C gives the answer pattern mean obtained by these com- 
putations. 


TABLE 3 


Basic Data for Worked Example 











Answer Pattern n £c rc (tc)? Deviance c d ¢ T M 
Ay YYYY 18 605 22,581 20,334.722 2,246.277 33.611 26.423 33.465 2 0O 
Ay NYYY 42 1,556 61,114 57,646.095 3,467.905 37.048 -1.523 6.397 3 OG 
Ay YNYY 27 604 17,448 13,511.703 3,936.296 22.370 5.145 23.011 1 0 
A, YYNY 18 861 42,275 41,184.500 1,090.500 47.833 +-6.230 45.745 3 1 
Ay YYYN 12 168 3,836 2,352.000 1,484.000 14.000 9.541 14.679 l 0 

NNYY 63 1,725 52,863 47,232.142 5,630.857 27.381 -.121 25.943 = ie) 

“ NYNY 42 1,997 100,023 94,952.595 5,070.405 47.548 -.007 48.677 4 0 
14 NYYN 28 505 11,821 9,108.035 2,712.964 18.036 +282 17.611 2 0 
23 YNNY 27 947 35,765 33, 215.148 2,549.852 35.074 2336 35.291 2 0 
Ang YNYN 18 84 1,972 392.000 1,580.000 4.667 +402 4.255 0 O 
Aa, YYNN 12 291 7,813 7,056.750 756.250 24.250 369 26.959 2 0 
Aln3 NNNY 63 2,321 91,363 85 ,508.587 5,854.413 36.841 -.217 38.223 3 O 
Alog NNYN 42 186 5,738 823.714 4,914.286 4.429 574 7.157 1 1?) 
Alaa NYNN 28 846 27,548 25,561. 285 1,986.714 30.214 -.863 29.891 3 0O 
234 YNNN 18 313 6,495 5 ,442.722 1,052.278 17.389 -.656 16.505 1 0 
Aj034 NNNN 42 927 24,471 20,460.214 4,010.786 22.071 . 157 19.437 = 











Total 500 13,936 513,126 388,424,192 124,701,808 27.872 33.612 423.216 


XC = 13,935.300 EC? = 463,601.606  £CC = 463,603.728 


2 
=M=18 zrM = 18 EMC = 861 ZT = 1100 eT = 2,890 ZTC = 36,011 
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Step 1—Calculation of the configural validity 


First, a one-way analysis of variance is computed. The column labelled 
deviance in Table 3 contains the sum of squared deviations (sum of squares) 
about each answer pattern mean. The sum of the 16 answer pattern deviances 
is W, the within group sum of squares. The column labelled (>> C)?/N 
contains a correction term for each answer pattern. The sum of the 16 cor- 
rection terms minus the correction for the total equals B, the between group 
sum of squares. 

TABLE 4 


Analysis of Variance 











Source df Deviance Mean Square 
Between 
Answer Patterns 15 76,358.020 5,090 .535 
Within 484 48 343.784 99.884 
Total 499 124,701.804 
7 “ = .612 F = 50.964 





These figures along with 7’, the deviance (sum of squares) about the 
total mean, are given, in the usual analysis of variance form, in Table 4. 
The formula for the configural validity is 


76, 358.020 _ 


(26) n= B/T = 124,701,804 ~ 612325. 





The test of significance is 


2 T t 
,_( \(n - 2) 25 (-£12828 (484) % 
oie é = a\(F 7) ae 
Since the .001 confidence level is 2.577, the configural validity is obviously 
greater than zero. If the F ratio was insignificant, the analysis would be 


stopped, for then no method of scoring the test would give a better-than- 
chance prediction of the criterion scores. 





Step 2—Calculation of the polynomial regression coefficients 

In Table 3, the column labelled d contains all 16 polynomial regression 
coefficients. Each coefficient was computed by adding together the 16 means 
(appropriately signed) and dividing by 2‘. The sign of each mean is given 
by (—1)’, where g is the number of common items in the subscripts of the 
regression coefficient and the mean. 
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For example, 


dy = Ae Cte + Os + On + Ce + On + On + On 


5 5 A 5 5 5 422.762 
+ Ca, + C103 + C24 + C134 + Co34 + C x22] = 16 — 26.423, 





yee o ¥ , : aI = Bs 2 
dy = 57 [Co — Ch + C2 + Cs + Cy — Cre — Cis — Cra + Cos + Cre 


“ 4 x 5 5 — 24.374 
- C4 — C103 ig Cio4 ve Css + Coss oe Cro34] = ae = — 1.523, 


is Pe . in e _ = as 7 _ 
dys = 57 [Co — C1 — Cr + Cs + Cy + Ci2 — Cis — Cra — Crs — Crs 


- . 5 “ 5 4 —1.930 
+ Cs + C123 + Cis < Cis4 Poe Co34 > C234] as 16 = — 121, 





and so on. 
These coefficients can be scanned in order to see which scoring method 


seems to give an optimal prediction of the criterion with the fewest parameters. 
For example, if, in the population, the relation between the items and the 
criterion were exactly linear, the only nonzero coefficients would be d, , d, , 
d, ,d; , and d, . In any actual sample, the other nonlinear coefficients would 
be small but not exactly zero. So one can simply look at the five linear co- 
efficients to see if their absolute values are larger than any of the other 
coefficients. Another such test is to check the frequency of negative values 
among the eleven non linear coefficients. In the linear case, the true probability 
of a negative value is 1/2. If these crude combinatorial tests do not contradict 
the hypothesis of linearity we proceed to the next possibility—that the total 
score will give maximum validity. 

The total score will give maximum validity in the population when all 
the conditions for linearity are met and in addition, | d, | = | d.| = | d,| = 
| ds |; i.e., the absolute values of the first-order coefficients are equal. Again, 
a crude check of this can be made in the sample by seeing if there is wide 
variation among the absolute values of the first-order coefficients. 

In Table 3 it can be seen that the linear coefficients (d, , d, , d. , d; , 
and d,) meet the first condition; their absolute values are larger than those 
of the nonlinear coefficients. Also the second condition is met; the ratio of 
negative nonlinear coefficients was 4/11, which is not significantly different 
from the expected value of 1/2. Therefore, the hypothesis of linearity is not 
contradicted. 

Next the first-order coefficients were examined to see if the total score 
was likely to have maximum validity. If so, the first-order coefficients (d, , 
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d, , d; , and d,) would be approximately equal. But d, was more than six 
times the absolute value of d, . So it is unlikely that total score would have 
maximum validity. 

As mentioned earlier the above crude tests can be used if the research 
worker has no definite hypothesis about the optimal scoring method. However, 
to demonstrate conclusively that linear scoring is sufficient for optimal 
brediction it is necessary to show that the multiple correlation, R,.,,2.3.4 , 
does not differ significantly from the configural validity 7, . To demonstrate 
conclusively that linear scoring is preferable to the total score and the multiple 
cut-off score, it has to be shown that the total score validity and the multiple 
cut-off validity are significantly less than the configural validity. 


Step 3—Calculation of the multiple correlation 


In Step 2, the linear hypothesis passed the first crude tests. To compute 
the linear multiple correlation R..,...3,4 first C, , the predicted criterion 
score for the rth answer pattern, was calculated. (This may not be the most 
convenient method, but it does show any large deviations from the linear 
hypothesis. For a perfect linear fit, C. = C, .) To obtain C it was first neces- 
sary to compute the linear regression coefficients; 1.e., Wo , Wi , We , Ws » Ws 
from (16). 

The matrices (K/nK,), (K/nK,)~* and (K!nC) are presented in Table 5. 
The regression coefficients are in the column w. Then C was obtained by 
applying the equation C = K,w. The predicted criterion means are presented 
in the C column of Table 3. R..;.2.3.4 is equal to r,z , the zero-order correlation 


TABLE 5 


Computation of Linear Regression Coefficients 

















K nk, 1000(K¢nk, )7? Kine w 
7 a a ae i x Xp k  % 
X, 900 -200 -100 0 100 X 2548 952 .417 0 -.417 13,936 26.451 
X, -200 500 40 0 -40 x, .952 2.381 0 0 0 6,190 =1.466 
xX, -100 40 500 0 -2 x,  .Al7 0 2.083 0 0 -278 $227 
a 0 0 0 Xx, 0 0 0 2.000 0 =3,070 =6.140 
Sum 300 300 420 500 540 3.500 3.333 2.500 2.000 1.666 11,694 33.465 





between C and C. 
This was computed by the well known formula 


(28) thi (N DU CC — DC DO)’ 
a WN EC - (SON SC? - (> O] 
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Column >> C in Table 3 contains the sums of the criterion scores for 
each answer pattern. To obtain >. CC, each C, was multiplied by the >> C 
for the rth answer pattern, and the result was summed over all patterns. 


Similarly, 


2t at 

C= 2 ati = 2 Ce md FC = Fat, = 5 C. 
r=1 r=] 

Substituting in (28) from the data in Table 3, 

a [500(463 ,603.728) — 13,936(13 935.300) ]’ =~ 603 

°¢ “~ [500(513,126) — (13,936) ][500(463 ,601.606) — (13,935.300)7]  *° 





Step 4—Comparison of the multiple correlation with the configural validity 
Applying (25), 


_ (612 — .603\/500 — 16) * 
(ox ( 388 )( a 
Since, for 11 and 484 d.f., the .05 level is 1.750, the F test indicates that the 
multiple correlation does not differ significantly from the configural validity; 
i.e., the multiple regression scoring method yields optimal validity. 





Step 5—Calculation of the total score validity 


In order to rule out conclusively the total score hypothesis, the zero- 
order correlation r, was computed between the total score and the criterion. 
In general, the total score, 7’, is equal to the number of yes responses for all 
items with positive first-order coefficients plus the number of no responses 
for all items with negative first-order coefficients. The column labelled 7 
in Table 3 gives the total score for each answer pattern. The usual formula 
for the squared correlation was used. 


(N 2,07 — 2.0 2,7)’ 





” "= TW he (NOW Or —- (Sen) 
Using the data from Table 3, 
[500(36,011) — (13 ,936)(1,100) ]? 489 





2 me = 
o [500(513,126) — (13,936)7][500(2,890) — (1,100)*]* 
Step 6—Comparison of the total score validity with the configural validity 
Applying (25), 


_ (612 — .489\/500 — 18) J 
“ -( 388 ) ~~ soe 


Since, for 14 and 484 d.f., the .05 level is 1.690, the F test indicates that 
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the total score validity is significantly less than the configural validity, 
i.e., the total score does not yield optimal validity. 


Step 7—Calculation of the multiple cut-off validity 


In order to rule out conclusively the multiple cut-off hypothesis, the 
zero-order correlation, 7,- , Was computed between the multiple cut-off score 
and the criterion. Multiple cut-off scoring demands that a score of one be 
assigned to the answer pattern with the highest (or lowest) mean and that a 
score of zero be assigned to all other answer patterns. Column M in Table 
3 gives the multiple cut-off score for each answer pattern. Substituting 
figures from Table 3 into the formula for squared correlation, 


‘ [500(861) — 18(13,936)) 


rme = 7500(513,126) — (13,936)"][500(18) — (d8)"] “°°? 





Step 8—Comparison of the multiple cut-off validity with the configural validity 
Applying (25), 


P= (22 - 960) (500 — 16 
it .388 16 — 2 


Since, for 14 and 484 d.f., the .05 confidence level is 1.690, the F test 
shows that the multiple cut-off validity is significantly less than the configural 
validity. 





) = 49.185. 


Discussion 


It has been shown how the concept of the configural scale can be used 
to give an exact statistical test of whether a selected scoring technique 
has optimal validity. Worked examples have been given for three well known 
test scoring methods: multiple regression, multiple cut-off, and total score. 
In general, the principal advantage of configural analysis is that all of the 
information concerning the subject’s test behavior is utilized. 

On the other hand, the principal disadvantages of configural analysis 
lie in the very fact that all the information is conserved; i.e., all possible 
answer patterns are considered. In a t-item test the formula for the configural 
validity involves 2‘ parameters; i.e., 2‘ answer pattern averages. It is im- 
mediately obvious that this technique is only appropriate for situations 
where the number of items is very small compared to the number of subjects— 
N must be much greater than 2‘. For example, even when the number of 
items is as small as 10, 2‘ will be 1024. 

Use of the equav coefficients for scanning purposes introduces another 
difficulty. The F ratio test no longer gives the exact confidence level; it is 
simply a decision function. The procedure of selecting the test scoring method 
which is most likely to yield optimal validity alters the significance level of 








H. G. OSBURN AND ARDIE LUBIN 371 


the F test (cf. [2], p. 199 ff.). As a way of deciding among several possible 
test scoring methods, the scanning technique is certainly a reasonable pro- 
cedure. However, it is advisable, after selecting a test scoring method on one 
sample, to cross-validate it on another sample. 

Configural analysis is most suitable in situations where testing time 
is short and the number of subjects is large. For example, take the case of 
neuropsychiatric screening in the armed forces where often only a few minutes 
of testing time is available, and a very large number of subjects must be 
screened. Here, items should be constructed in such a way that all 2‘ regression 
coefficients are significant. This will give maximum discrimination. However, 
in actual practice some of the regression coefficients will probably be non- 
significant. If this occurs, a value of zero should be given to all nonsignificant 
coefficients. The use of any other values will lower the validity of the test. 
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A MODEL FOR RESPONSE TENDENCY COMBINATION 


Davin BircH 


UNIVERSITY OF MICHIGAN 


A model is proposed to predict the performance on a compound stimulus 
as a function of the performance on the component stimuli in a two-choice 
situation. Data from a learning task are used to evaluate the model. 


Any theory of behavior which analyzes a stimulus complex into com- 
ponents and attempts to account for responses to the complex on the basis 
of response tendencies to the components faces the problem of specifying 
the rule for the combination of the component response tendencies. Theorists 
such as Hull [4], Thurstone [8], Gulliksen [3], Spence [7], Estes and Burke [2], 
Bush and Mosteller [1], and Restle [5] have incorporated combination rules 
within their theories and then made use of them in deriving implications 
from their theories. Seldom, however, has the combination rule itself been 
the focus of attention for these theorists. One recent instance is a study by 
Schoeffler [6] who carried out a test of a combination rule derived from the 
Estes-Burke learning theory. The rule is linked directly to the parameters 
of the theory and certain assumptions about the parameters are made by 
Schoeffler in bringing the rule to test. 

This paper presents the development of a model for combining response 
tendencies in a two-choice situation and reports a test for the fit of the model 
to data. The basis for the definition of the parameters of the model proposed, 
as well as the impetus for the development of the model, are derived from 
Hullian behavior theory. However, the combination rule, specified by the 
interrelationships of the parameters of the model, does not depend upon any 
particular learning theory and, therefore, may be of value in a variety of 
situations where problems of combination arise. 


A Model for Response Tendency Combination 


The difference in response tendency strength for stimulus a, D, = 
(2. — .f,), at any given point in time will be considered to be in one of 
three states: D, > d, D, < —d, or —d < D, < d, where d is a parameter 
with a value such that Pr {u|/D, > d} = 1, Pr {u/D, < —d} = 0, and 
Pr {u| —d < D, < d} = .5. Let Pr {D, > d}, Pr {D, < —d}, and Pr {-d< 

*The initial work on the model was carried out under the Summer Faculty Research 
Fellowship Program of the Horace H. Rackham School of Graduate Studies at the Univer- 


sity of Michigan. Acknowledgment is also due Mr. Richard Anderson for his assistance 
in data collection. 
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D, < d} = 1 — Pr {D, > d} — Pr {D, < —d} be the probabilities that 
the difference in response tendency strengths is in each of the three states. 

It then follows that the compound probability of obtaining u and v 
when a is presented may be written as 


(1) Pr {ula} = Pr {D, > d} + (.5)[1 — Pr {D, > d} — Pr {D. < —d}) 
and 

(2) Pr {vja} = Pr {D. < —d} + (.5)[1 — Pr {D. 2 d} — Pr {D, < —d}). 
A corresponding development for b gives 

(3) Pr {ulb} = Pr {D, > d} + (.5)[1 — Pr {D, > d} — Pr {D, < —d}] 
and 


(4) Pr {vlb} = Pr {D, < —d} + (.5)[1 — Pr {D, > d} — Pr {D, < —d}]. 


Since Pr {ula} + Pr {v]a} = 1 and Pr {u\b} + Pr {v|b} = 1, there are avail- 
able two independent equations in the four unknowns, Pr {D, > @}, 
Pr {D, < —d}, Pr {D, > d}, and Pr {D, < —d}. 

The compound stimulus (a, b) is defined as the joint presentation of a 
and 6, and this compound stimulus can be characterized in four mutually 
exclusive and exhaustive ways by the responses obtained to a and b upon 
separate presentation of these stimuli. That is, a and b may both be responded 
to with u, a may be responded to with u and b with v, a may be responded 
to with v and b with u, or both a and b may be responded to with v. Let the 
corresponding designations of (a, b) be (a, , b.), (a. , b,), (a, , b,), and (a, , b,). 

If the total probability of u to the presentation of (a, b) is denoted 
Pr {ul(a, b)}, then 


Pr {ul(a, 6)} = Pr {ul(a, , b.)}-Pr {a. , bu} + Pr {ul(a, , b.)} 
‘Pr {a, , b,} + Pr {ul(a, , b,)}-Pr {a, , bu} 
(5) + Pr {u|(a, ’ b,)}-Pr {a, ’ b}, 


where the entries on the right-hand side of the equation are the independent 
contributions from the four classes of (a, b). By writing each of these terms 
separately as a function of Pr {D, > d}, Pr {D, < —d}, Pr {D, > d}, and 
Pr {D, < —d}, a total of six experimentally independent equations in the 
four unknowns will be available so that the values of the unknowns are 
overdetermined. 

It follows from (1), (2), (3), and (4) that the probabilities of occurrence 
of the four classes of (a, b) are 
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Pr {a, ,.b.} = Pr {ula}-Pr {u|b} = Pr {D, > d}-Pr {D, > d} 
+ Pr {D, > d}-(.5)[1 — Pr {D, 2 d} — Pr {D, < —d}} 
(6) + (.5)[1 — Pr {D, > d} — Pr {D, < —d}]-Pr {D, > d} 
+ (.5)[1 — Pr {D, > d} — Pr{D. < —d}] 
.(.5)[1 — Pr {D, > d} — Pr {D, < —a}]; 
Pr {a, , b,} = Pr {ula}-Pr {v|b} = Pr {D, > d} 
-Pr {D, < —d} + Pr {D, > d}-\.5)[1 — Pr {D, > d} 


(7) — Pr {D, < —d}] + (.5)[1 — Pr {D. > @} 
— Pr {D. < —d}]-Pr {D, < —d} 
+ (.5)[1 — Pr {D, > d} — Pr {D. < —d}] 
*(.5)[1 — Pr {D, > d} — Pr {D, < —d}]; 
Pr {a, , bu} = Pr {vla}-Pr {ujb} = Pr {D, < —d}-Pr {D, > d} 
+ Pr {D. < —d}-(.5)[1 — Pr {D, > d} — Pr {D, < —d}] 
(8) + (.5)[1 — Pr {D, = d} — Pr {D. < —d}]-Pr {D, > d} 
+ (.5)[1 — Pr {D. > d} — Pr {D. < —d}] 
-(.5)[1 — Pr {D, > d} — Pr {D, < —d}]; 
and 
Pr {a, , b,} = Pr {v|a}-Pr {v|b} = Pr {D, < —d}-Pr {D, < —d} 
+ Pr {D, < —d}-(.5)[1 — Pr {D, > d} — Pr {D, < —d}] 
(9) + (.5)[1 — Pr {D. > d} — Pr {D. < —d}]-Pr {D, < —d} 
+ (.5)[1 — Pr {D, > d} — Pr {D. < —d}] 
(.5)[1 — Pr {D, > d} — Pr {D, < —d}]. 


The probability of u for each of the classes may be obtained by weighting 
each component of Pr {a, , bu}, Pr {a, , b,}, Pr {a, , bu}, and Pr {a, , b,} 
by an appropriate conditional probability of occurrence of u. The weights 
assumed are as follows: the conditional probability of u is 1, given that the 
combinations of response tendency states for a and b are D, > d and D, > d, 
or D, > dand —d < D, < d, or —d < D, < dand D, > d; the conditional 
probability of u is 0 given D, < —d and D, < —d, or D, < —dand —d < 
D, < d, or —d < D, < dand D, < —d; and the conditional probability 
of u is .5 given D, > d and D, < —d, or D, < —d and D, > d, or —d < 


A |lA_ IA 
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TABLE 1 
Assumed Conditional Probability of Occurrence of u to the 
Compound Stimulus for the Possible Combinations of 


Response Tendency States of a and b. 








Response 
Tendency Response Tendency States for b 


States for a 
(D, 2 4) (-d < Dd, < a) (Dp S -a) 








(Da 2 a) 1 1 5 
(-d < D3 < 4) 1 5 fe) 
(D, € -d) 5 fe) fe) 





D, < dand —d < D, < d. These assumed values are presented in Table 1. 

These weights in conjunction with (6), (7), (8), and (9) produce four 
equations in the four unknowns Pr {D, > d}, Pr {D, < —d}, Pr {D, > d}, 
and Pr {D, < —d}. Since the model under development was instigated by 
the problem of the prediction of performance to a compound stimulus as a 
function of the performance to the component stimuli, the relationships of 
(1), (2), (8), and (4) may be used to reduce (6), (7), (8), and (9) to functions 
of the two unknowns Pr {D, > d} and Pr {D, > d}. The resulting, simplified 


equations are 
Pr {ul(a, , b,)}-Pr {a, , bu} = (.5)[Pr {D. > d}-Pr {u]d} 
(10) + Pr {D, > d}-Pr {ula} — Pr {D, > d}-Pr {D, > dj 
+ Pr {ula} -Pr {u]b}]; 
Pr {ul|(a, , b,)}-Pr {a, , b,} = (.5)[Pr {D. = d}-Pr {v]b} 


(11) 

— Pr {D, > d}-Pr {ula} + Pr {ula}-Pr {u]b}]; 
(12) Pr {ul(a, , b,)}-Pr {a, , b,} = (.5)[Pr {D, > d}-Pr {va} 

— Pr {D, > d}-Pr {ub} + Pr {ula}-Pr {u]b}]; 
and 


Pr {ul(a, , b,)}-Pr {a, , b,} = (.5)[—Pr {D, > d} 
(13) -Pr {ulb} — Pr {D, > d}-Pr {ula} + Pr {D, > d} 
‘Pr {D, > d} + Pr {ula}-Pr {u|b}]. 
Finally, (5) becomes: 
Pr {u|(a, b)} = Pr {D, > d}[(.5) — Pr {ulb}] 


(14) 
+ Pr {D, > d}[(.5) — Pr {ula}] + 2 Pr {ula}-Pr{u}b}. 




















DAVID BIRCH 377 


It may also be noted from (10) and (13) that 

2 Pr {ul(a, , b,)}-Pr {a, , bu} — Pr {ula}-Pr {ujb} 
= Pr {D, > d}-Pr {ulb} + Pr {D, > d}-Pr {ula} 
— Pr {D, > d}-Pr {D, > d} = Pr {ula}-Pr {ub} 
— 2Pr {ul(a, , b,)}-Pr {a, , b.}, 

which indicates that it is necessary that 

Pr {ula}-Pr {ulb} — Pr {ul(a, , b,)} 
-Pr {a, , b.} = Pr {ul(a, , b,)}-Pr {a, , 0,} 


if these two equations are to be consistent. The latter relationship provides 
a partial test of the model since all four of the values are experimentally 
independent observables. If this relationship can be shown to hold within 
reasonable limits, then (10) and (13) may be combined into 


Pr {ul(a, , b.)}-Pr {a, , bu} — Pr {ul(a, , b,)} 
(15) -Pr ja, , b,} = Pr {D, > d}-Pr fulb} + Pr{D, > d} 
-Pr {ula} — Pr {D, > a}-Pr {D, > ad}. 


A Test of the Model 


To obtain data for a test of the model, a learning task was carried out 
in which subjects were required to associate the response Dac to each of 
ten letter pairs and ten number pairs, and the response Jix to each of another 
set of ten letter pairs and ten number pairs. In dealing with the resulting 
data, it is convenient to define wu as a correct response, C’, and v as an incorrect 
response, J. Also, stimulus a is defined as the set of twenty letter pairs, ZL, 
and stimulus 6 as the set of twenty number pairs, NV. 

In constructing the letter pairs a total of ten letters (B, F, G, H, K, N, 
Q, S, Y, and Z) were used, and these were paired in such a fashion that each 
letter appeared in the first position twice, once to be associated with Dac 
and once with Jiz, and in the second position twice, again once to be associ- 
ated with Dac and once with Jiz. No two letters were ever paired more 
than once. The ten digits (0 through 9) were treated in similar manner. 

The subjects, 145 male volunteers from the introductory psychology 
class, participated in groups of approximately 14. To provide an opportunity 
for learning, the letter pairs and number pairs were shown individually in a 
random sequence on flash cards for 10 seconds, with the correct response 
exposed during the last 6 seconds. Subjects recorded no responses during 
the learning series but did record their responses during the test series, 
which was alternated with the learning series. 
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Three learning and three test series were given. In the test series each 
of the 20 letter pairs, the 20 number pairs, and 20 compounds were presented 
individually in random order. Each compound was made up of a letter pair 
and a number pair chosen at random without replacement under the re- 
striction that the same response be correct for both the letter pair and the 
number pair. Each test series employed a different pairing of the letters and 
numbers in the compounds so that the same compound never appeared more 
than once. Exposure time for each stimulus in the test series was five seconds, 
during which time the response was written. 

Five experimental groups are distinguished on the basis of the relative 
amount of training offered on letters and numbers. For Group I of 30 sub- 
jects, two exposures of each letter pair and each number pair were provided 
in each learning series (L:N = 2:2); for Group II of 28 subjects (L:N = 2:1); 
for Group III of 33 subjects (Z:N = 1:2); for Group IV of 26 subjects 
(L:N = 3:1) and for Group V of 28 subjects (Z:N = 1:3). 

The measure of the probability of correct response to the letters, the 
numbers, and the four classes of the compound is obtained for each of the 15 
test series by pooling over stimuli and subjects. Table 2 contains these data. 


TABLE 2 


Obtained Probability of Correct Response for Letters Alone, Numbers Alone 
and the Four Classes of the Letter-Number Compounds 











Group I II III Iv Vv 
(L:N 22:2) (L:N#2:1) (L:N#1:2) (L:N#3:1) (L:N #1: 3) 
Test A) ae 8 ak Se > ee Hi eg a ee 8 
P {cit .59 .67 .74].57 .62 .62].55 .63 .67]|.66 .76 .80].51 .59 .62 
PScing 56 .73. .78|.53 .62 .64).56 .71 .74|.56 .63 .70|.59 .68 .82 


Pr fol(tc, No) f Pr fic, ug 25 .46 .57].24 .33 .33].22 .41 1481.30 .44 .55].22 .38 
Pr fci(to, ra Pr {i, nf 013.10 .10].12 .12 .14].11 .09 .07].23 .20 «171.11 .03 
Pr folit,, n)} 

Pr f(xy, vy} Pr }L 


Pr {cfit, nh 59 .75 .84|.60 .64 .67|.59 .74 .76|.66 .76 .83].64 .68 


prf{ty, wo a3. 317 425:9 413 222 334] .38° 52 17 1.09 ..b7 894.22 23 


I 
1 Nyt .08 .02 .02].11 .07 .06].08 .o4 .04].04 .05 .02].09 .04 
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A first test of the model comes from the relationship Pr {C|Z}-Pr {C|N} 
— Pr {C\(Le , Nc)}-Pr {Lc , Nc} = Pr {C|\(L,, N1)}-Pr {L,, N1} derived 
from (10) and (13). The differences between the values for the left side and 
the right side of the equations for the 15 observations have a mean of —.004, 
a range of —.03 to .05, and a root mean square deviation around 0 of .02. 
It would appear that. the fit is sufficiently good so that (11), (12), and (15) 
may be used in a further test of the model. 
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TABLE 3 


Derived Values for rf p> at and prf D, > a and Predicted Probability of 


Correct Response for the Four Classes of the Letter-Number Compounds 











Group I II XItT IV Vv 
(L:N =2:2) (L:N=2:1) (L:N=1:2) (L:N=3:1) (L:N=1:3) 
Test ge Or esata o> Gund Sa a eon eee eee, oS 
rf >. af 627.43 655] .25 «37.33.08 .30 .40].41 .58 .671.09 .19 .28 
Prf p> Bt 7 657) 165°) 27 637 63 [222 153 660) .13 ©6390 2584.13 62. .69 


Prf cite, | Pr{ to, vg 28 <hr .56 1.26 35 <36°1.23: .42 401.38 «8S .SS ) 216 139°. 99 
Pr fol(te, uy) rf to, nz} SRW 920, IP Gas 615° ctu .at 510) ..09)f..23 83 74) 20. 0a, 07 
Pref ci(ty, nig) pr fy, No o14° 18 .16].14 115 .15].18 .22 .20}.09 .09 .10].21 .26 .27 
Pr fop(ny, nt Petty, uf 205 .02 .10].04 .03 .05 1.08 .03 .02].05 .03 .Q1].24 .01 .02 


. 4 |.57 68.7 7 i79|.69 .80 .83].61 .71 .85 
Pr fou, un} <61 .78 .84 [57 .68 <71.1.60 .77 :79:}.69 .60 - 431.62 «72 .65 




















With three equations, the two unknowns are overdetermined, and since 
there is no a priori reason for selecting any particular pair of equations for 
solution, it was decided to obtain all three solutions and use the means as 
the best estimates of Pr {D, > d} and Pr {Dy > d}. Accordingly, the appro- 
priate empirical values for each of the three tests for each of the five groups 
were substituted into the equations and solutions for Pr {D, > d} and 
Pr {Dy > d} obtained. The resulting mean values are shown in Table 3. 
To obtain an indication of the consistency of the three equations, the standard 
deviation of the three estimates of Pr {D, > d} and Pr {Dy > d} for each 
of the 15 determinations was computed. These values ranged from .01 to 
.19 and yielded a mean and median of .09 and .09, respectively, for 
Pr {D, > d}. The range for Pr {Dy > d} was .01 to .19 with a mean and 
median of .10 and .12, respectively. 

The predicted values for Pr {C|(Zc , Nc)}-Pr {Lc , Ne}, Pr{C|(Le,Nn}- 
Pr {Le , Nz}, Pr {C\(Zr , Nc)}-Pr {L1, Nc}, Pr {C|(Lr, N1)}-Pr {L1, N:}, 
and Pr {C|L, J} using mean Pr {D, > d} and mean Pr {Dy > d} are con- 
tained in Table 3. A comparison of the obtained values of Table 2 with the 
predicted values of Table 3 shows a quite satisfactory fit except for a small 
consistent tendency for the predicted values of Pr {C\(Le , Ne)}- 
Pr {Le , Nc}, Pr {C\(Le , Nr)}-Pr {Le , Nr}, and Pr {C\(L; , Ne)}- 
Pr {L; , Nc} to be too high and the predicted values of Pr {C|(L; , Ny)}- 
Pr {L, , Nz} to be too low. This discrepancy is reflected in mean differences 
of .007, .012, .013, and —.013 between predicted and obtained values for 
these classes of the compound stimuli. The root mean square deviations of the 
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differences around 0 for the same four classes of the compound stimuli and 
for the total, Pr {C|L, N}, yielded values of .021, .016, .018, .028, and .031, 
indicating further the adequacy of the fit of the model to these data. 


(1) 
[2] 
[3] 
[4] 
[5] 
[6] 
[7] 


[8] 
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PROCEDURES FOR OBTAINING SEPARATE SET AND 
CONTENT COMPONENTS OF A TEST SCORE* 


GERALD C. HELMSTADTER 


COLORADO STATE UNIVERSITYT 


Using two distinct models, several formulas for obtaining separate set 
and content components of a test score have been derived. Comparisons among 
the methods are made algebraically and through their application to a set of 
test data apparently affected by response sets. 


Cronbach [1] has pointed out that when a test is composed of difficult 
items having but two or three alternatives, a response set is likely to affect 
the total test score. By response set is meant “‘any tendency causing a person 
consistently to give different responses to test items than he would when 
the same content is presented in different form.’’ Cronbach has further 
shown [2] that such an effect shows test-retest stability, and he feels that there 
is adequate evidence to conclude that various response sets reflect “real’’ 
dimensions of human differences. As he further points out, some of the 
response-set variance is potentially useful while some of it will interfere 
with measurement. To be able to capitalize on the effect of response set when 
it is useful and to eliminate it when it is undesirable, some procedure for 
obtaining separate set and content components of a test score is necessary. 
This paper presents a logical basis and compares a number of scoring pro- 
cedures for separating the response set from the score reflecting individual 
differences with respect to the obvious item content. Because response sets 
most commonly occur in tests composed of items which have but two alter- 
natives, the following discussion is restricted to this case. Also, for eon- 
venience, it will be assumed that no items are omitted. 

To simplify the discussion a single set of notation will be used throughout. 
For convenience, these are listed together below. Further explanation of the 
terms will be made as each is used. 


*The author wishes to thank Dr. Norman Frederiksen, who brought this problem 
to his attention, and Drs. Ledyard R Tucker and Frederic M. Lord, who suggested three 
of the procedures. All have contributed materially to this paper through many helpful 
suggestions and criticisms. 

{This paper was written while the author was an Associate in Research at the Edu- 
cational Testing Service, Princeton, New Jersey. 
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N, and Ng = number of items keyed A and B, respectively.* 
K, and Kg = number of items keyed A and B, respectively, which have been answered 
correctly on the basis of content. 


P, and Pz = the examinee’s probability of marking an item not answered on the basis 
of content A and B, respectively. 
R, = number of items keyed A and marked A by the examinee. 
Rg, = number of items keyed B and marked B by the examinee. 
W. = number of items keyed B but marked A by the examinee. 
Ws = number of items keyed A but marked B by the examinee. 
Aga; » Ape; | = proportion of examinees who marked A for an item classified by the 
= {subscript as follows: a = keyed A; 8 = keyed B; a; = marked A by 
Aas; App; person 7; b; = marked B by person 1. 
C; = content score by procedure j. 
S; = set score by procedure j. (S; will always mean a set to mark response A.) 


The Scoring Procedures 


Usual Total Score Procedure 


Ordinarily, a test is scored by simply counting the number of items in 
agreement with a key. Thus, in the present terminology, the usual scoring 
procedure can be expressed as 


(1) C, = RB, +R;. 


In such a procedure, no attempt is made to distinguish between the 
relative effects of content and set. However, a straightforward (though 
not entirely satisfactory) way of obtaining an estimate of set would be 
to take the difference between the number of items the examinee marked 


A and the number he marked B. Thus, let 
(2) S; = R, + Wa = (Re + Ws). 


Warranted Set Procedure 


One possible solution to the problem of eliminating the effects of a 
response set is to give the examinee a score based upon the extent to which 
his tendency to mark response A rather than response B is warranted. 

Consider the relationship between the keyed response and the examinee’s 
response illustrated in Figure 1. One could think of the elements of this 
matrix as follows: 

R, and R,; = examinee’s warranted set to mark A and to mark B, respec- 
tively; 

W, and Ws, = examinee’s unwarranted set to mark A and to mark B, respec- 
tively; 


*Throughout this paper it will be helpful to remember that the Greek subscripts 
indicate the way an item was keyed, while the English subscripts indicate the way an 
item was marked by the individual or by the group of examinees. 
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Examinee's Response 
































A B Total 
o< Ra We Nx 
Keyed 
Response 8 Wa Rp Ne 
+ + + 
Total Ry Wa Rp We Na Ne 
Fig. 1. Observed Score Matrix 
Ra + Wa 
and = examinee’s total set to mark A and to mark B, respectively. 
Re + Ws 


Then, if any indeterminate ratios are put equal to zero, a content score 


could be defined as 


warranted set to mark A , warranted set to mark B 


Y 








7 total set to mark A total set to mark B 
Ra Rez 
3) - 4 ; 
8) n+, By + We 


Similarly, it would be possible to define a set scores as 


unwarranted set to mark A al unwarranted set to mark B 








S = 
F total set to mark A total set to mark B 
(4) - WwW. i Ws, ¥ 
Ra + Wa Rez = Ws 


Postulated Knowledge Procedure 


A third way which might be used to score a test which is subject to 
the effects of a response set involves an estimation of the number of responses 
actually based on content. Following a logic somewhat similar to that used 
in deriving scores which ‘‘correct for guessing,’”’ the number of items observed 
as marked in agreement with the key (i.e., 24 and R,) can be thought of as 
resulting from a summation of those items marked in a given direction on 
the basis of content and of those items marked in this same direction on 
the basis of set. If K, and Kg represent the number of a-items and §-items 
marked A and B, respectively, on the basis of content, then V, — K, and 
Nz — Kg represent the number of a-items and 8-items, respectively, which 
have been answered on some other basis. Then, if P, and P, represent the 
probability that an individual marks an item A or B, respectively, on a 
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basis other than content, the number of a-items and the number of £-items 
marked in agreement with the key on a basis other than content will be 
P,(N,. — K,) and Ps(Ng — Kg), respectively. Thus, it can be said that 


(5) Rs = K, + PA(Na nom K,) 
and 
(6) i = Ks + P3(Ne ~ Ks). 


If no omits are permitted, the examinee must mark either A or B and thus 
(7) Fe a P», = a. 


A final equation necessary to obtain values for K, , Kg , P4 , and Ps, can 
be obtained by assuming that the a-items are equal in difficulty to the 8-items. 
This assumption can be expressed as 
(8) Ka zs Ke 

Nz Ne 


The simultaneous solution of these last four equations yields 











(9) Ke = ra(X*) a Ws ’ 
x Ns 4 

(10) Ky = Baty) — Was 
ss 

(11) Ps=-WNW,+NW,’ 

(12) P, id N aW ep 


~ NgeWa+N.Wa 
Given these values, the obvious content score is 


(13) C; = Ka + Kg 


I 





N N. us 

(14) Ra, va) + r,(¥*) — (W, + W;). 
N. Nz 

While P, itself could serve as a measure of set toward A, it is more con- 

venient to let 


(15) S; = 2P, — 1. 
Here S, ranges from —1 to +1 and is 0 when the examinee is just as likely 
to mark an unknown item A as he is to mark it B. Thus, 

2NaWa 


(16) S; = NW, + NW, ~ 1 
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Orthogonal Score Procedure 


Another possibility is to consider the set and content scores as orthogonal 
traits. In this conception of the problem, each examinee is represented by a 
point in a plot of the proportion of items keyed A which were marked A 
(i.e., R4/N.) against the proportion of items keyed B which were marked A. 
If this is done, the vector going from (0, 0) to (1, 1) can be considered a set 
axis and the vector going from (0, 1) to (1, 0) a content axis. The set and 
content scores of the examinee can then be defined as some function of the 
projection of his plotted point on these respective axes, e.g., that given in 
Figure 2. 


k 


B's marked A 


Set toA 
4 Ss 
a a 
Nox N 


O o's marked A 
O | 


Fig. 2 Set and Content as Orthogonal Traits. 








The scores can readily be obtained in terms of the observed measures 
by applying a 45° rotation and making an appropriate translation of the axes. 
For convenience, translations which make all content scores positive and 
which make set scores equal to zero when at a chance level (i.e., when 
P, = Ps = 3) have been used. Thus, by this procedure: 





me ee ); 
(17) C, = zror( Ra Ns + 1 ; 
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- Fea, Wa _ ). 
(18) 8, = s707( £4 a 


Postulated Scale Score 


All of the procedures thus far discussed have assumed that the items 
in the test were dichotomous with respect to content. That is, the items 
have been considered to represent either A or not A. The solutions to the 
problem of obtaining separate set and content scores presented in this and 
the following sections make a different assumption: that the extent to which 
items represent A can be expressed as a continuous variable. Thus, the test 
items are visualized as falling along a unidimensional scale characterized 
by the content. For example, in an inventory designed to detect an authori- 
tarian personality, it might be preferable to think of statements (with which 
a respondent is asked to indicate his agreement or disagreement) as repre- 
senting various degrees of authoritarianism rather than as being classified 
as authoritarian and nonauthoritarian. 

One possibility under this view is to postulate, for each individual, a 
characteristic curve relating the probability of his marking an item of a 
given scale value as A to the scale value of the item. Two such curves are 
illustrated in Figure 3 

The content score of the individual would be his ability to distinguish 
between items having different scale values and thus would be represented 
by the slope of the curve. On the other hand, the set score measures his 
tendency to call all items A or all items B and therefore would be represented 


Probability of marking 
1.0] anitem A 







0.5; 


Tendency of item 
to represent c¢ 


? 





O 





Fig.3. Item Gharacteristic Curve for 
Two Individuals. 
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by some index of central tendency which would locate the position of the 
curve on the scale. The scale value corresponding to probability 1/2 of 
marking an item A could be used for this purpose. 

The problem, then, is to determine the important parameters of this 
characteristic curve from observations of one or zero (i.e., calling an item A 
or not calling an item A) for each person for each item. While theoretically it 
would be possible to obtain the maximum likelihood estimators of the desired 
parameters, preliminary work, specifying first the normal ogive and then a 
straight line as the form of the characteristic curve, indicated that the solu- 
tions are far too complex to be of practical value. 

One approximation which is feasible, however, is the following. Assume 
that all the items fall at only two points which differ along the scale charac- 
terized by A. Those items keyed A could be considered as estimates of one 
point, and those keyed B of the second. Then the scale value of each of these 
points could be obtained by averaging, within each group of items, the normal 
deviates corresponding to the proportion of persons marking the item A, 
that is by taking 


(19) Be = HIE ae + DY Zand 
and 
we 
(20) Zs a N, [> Zsa, + pa Zapp, I; 
8 ai be 


where Z is the normal deviate corresponding to the indicated proportions. 
Then, the normal deviates for the proportion of A items and the proportion 
of B items which each individual marked A can be plotted at these points 
as indicated in Figure 4. The slope of the line determined by these two points 
can now be used as a content score, and the height of the line at its midpoint 
can be used as a set score. Thus, 





ZRa/Na La Zwa/ns 
c; = = = 


(21) Z.—y 
and 
(22) Ss a 3(Zrava + Zia/ns): 


Correlation Procedure 


An alternative procedure for obtaining a content score, assuming the 
items can be scaled, is to compute the biserial correlation between the exam- 
inees’ dichotomized responses and scale values of the items along a content 
continuum. Both Tucker [6] and Lord [4] have indicated that this biserial 
correlation is a simple function of the slope of the characteristic curve when- 
ever the scale values of the items have a norma! distribution for the particular 
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Norma! deviate for 
proportion of items an 
, individual marked A. 








+5 Zj 
“WNL 
07 
Z; 
Way Tendency (in normal 
{ deviates) of item 
groups to represent &. 
“5 - 
-5 4 0 t +5 
2p 7a 


Fig. 4. An Approximation to the 
Item Characteristic Curve. 


test under consideration. Thus, the correlation procedure provides another 
means of estimating the slope of the characteristic curve shown in Figure 3. 
Also, it seems quite reasonable to consider an examinee who can successfully 
rank all the items in the test according to their relative position along the 
content continuum as having the ability to make very good discriminations 
with respect to the content, regardless of where he would locate the items 
as a group along the axis. 
The formula for biserial correlation is 


(23) r, = (M, — M,)pq/Zo, . 


If the proportion of examinees (in a standard group) that marked the item A 
is used as the scale value of the item, the elements for the biserial correlation 
formula can be expressed in terms of the notation used here as follows: 


ea wre. _Ret+W,. 


Voth,’ =" Rete’ 


: Mics + > Aso; : Bigs + x Aga; 
M, = .4+%,-h, ° 

















ee 


o, = 0437, = C,;Z = normal deviate corresponding to p. 
There seems to be no direct suggestion for a set score by a correlational 
procedure. 





—— a er 
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Comparison of the Methods 


When several approaches are proposed for the solution of a single problem, 
it is imperative that some attempt be made to determine the extent to 
which the various solutions produce similar results. Thus, all of the formulas 
have been compared and those which were not algebraically identical nor 
linearly equivalent used to obtain separate set and content scores for 62 
individuals who had been given an experimental test designed to measure 
one aspect of report-writing ability. 

First consider the content scores, writing them solely in terms of the 
readily obtained quantities R, , Rg , N., and Nz. 


(24) C, = Ra + R; |; 


Ra + Rp e 
Ne t+ (Ra — Bs) Na — (Ra — Rp)’ 





(25) C, rs 








(26) Cs = Ra(Np/Na) + Ra(Na/Ne) — (Np — Ra + Na — Ra) 
(27) = (Na + No)[(Ra/Na) + (Rs/No) — 1] 
(28) C, = .707{(Ra/Na) — ((Ne — Re)/Ne] + 13 
(29) = .707[(Ra/Na) — (Rs/No)] 
.707 

(30) “ore C; + .707; 

= ZRaA/Na ae" Zwp-Rar/Ne. 
31) oC, = aay A 


If Z, — Z,; be the unit of the scale, noting that Z,_, = —Z,, 
(32) Cs = ZRaiNa + Zre/Np . 
[Na pial (Ra ten R;)][>_ Rus + > Aga; | 

















a saci tated 
[Ne + (Ra — Rs) IL Aa, + > Ag; ] 
it Zou(Na + Np) 
(34) p>, fon + p> Aga;) es a> A ar; + Zz; Ags,) 
_ a 8 @ 2 
ZoA(N. ok Na) ? 
where 
Na. — (Ra — Rs) q « ThE Bee Fe. 


al ae “se N. +N, 
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Expressed in these terms it is readily apparent that C; and C, will place 
examinees in the same order and are linear functions of (R4/N.) + (Rs/Ns) 
which, for convenience, will be called the simplified ratio score and be de- 
signated by C; . Also, it is interesting to note that when N, = Nz, C; is 
perfectly correlated with the number correct. 

Next, consider the set scores, again writing each in terms of the quantities 
R,, Rz ,N., and Nz: 











(35) S, = Ri -{- Nz at Ee — Rp a nN. — Ri 
(36) = N, — N. + 2(R, — Rs); 

: ___Ns-R,p_ _ _—iNa= Re 
(37) Ss 3 R, ~f- Nz cas is is oe Na > R, 
(38) = ae KR 





V.~@,-8,) 8pr+h, -8)’ 
2N (Ns — Ro) 











mm) “Win, —2,) +N, - ep 
IN.) — (Re/N 

i " ee y taal 

(41) 8, = .707{[(Ra/Na) + (Np — Rs)/No] — 1} 

(42) = .707|(Ra/N.) — (Re/No)); 

(43) Ss = O(Zrayva + Ziwp-roy/ne) 

(44) = 5(Zrasna ‘aus Zre/Np ° 


In this instance, no two procedures will produce identical results insofar 
as the ranking of the examinees is concerned. Since, however, S, will rank 
individuals the same as (R4/N.) — (Rs/Ng), this procedure will hereafter 
be designated S, to indicate that it is the companion of the simplified ratio 
score, C7 . 

The examination used in the empirical comparison of the methods was 
designed to measure a person’s ability to recognize when one expression could 
be substituted for another without altering the meaning of the statement. 
Each item consisted of a short statement containing a word or expression 
which had been underlined and an alternative expression in parenthesis at 
the end of the sentence. The task was to indicate whether or not the alter- 
native expression could be substituted for the original one without influencing 
the possible consequences should some policy decision, administrative action, 
or legal claim hinge upon the interpretation of the statement. In scoring the 
test, 70 such statements were used. Forty-five items were keyed same (i.e., 
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the consequences would be the same no matter which alternative expression 
was used) and 25 were keyed different. 

In an experimental tryout, the test was administered to 62 students in a 
graduate school of journalism. When the results were analyzed, it was noted 
that while the corrected odd-even reliability of the total score was only .47, 
similar reliabilities, obtained by scoring separately those items keyed ‘‘same’”’ 
and those keyed “different,’’ were .79 and .72, respectively. This fact, em- 
phasized by the resulting correlation of —.54 between scores obtained on the 
“same” items and those obtained on the “different’’ items, led to the con- 
clusion that a response set was affecting the results. 

Because these results suggested a real difference among individuals on 
a variable other than that measured by the total score, it was felt that the 
data would be appropriate for a comparison of the various procedures sug- 
gested for obtaining separate content and set components of a test score. 
Consequently C, and S, (total score and simple difference), C, and S, (content 
and set by the warranted set procedure), S; (set by the postulated knowledge 
procedure), C’; and S; (content and set by the postulated scale score procedure) 
C. (content by correlation procedure), C; and S; (content and set by the 
simplified ratio procedure) were all computed from the data and the inter- 
correlations obtained. The results are presented in Table 1. The values 
below the diagonal represent the median value of the correlations in the 
respective block. 


Discussion and Conclusion 


It will be recalled that two fundamentally different models have been 
used in the derivation of the various indices. One model assumes that there 
are right and wrong answers to the items and that the degree of set can be 
determined from the answers which disagree with a key. The total score, 
the warranted set, the simplified ratio and the postulated knowledge pro- 
cedures are definitely of this type. The other model, of which the prototype 
is the correlation procedure, assumes instead that the items can be scaled 
along a continuum. The postulated scale score, while based on the latter 
model, requires the use of a key to obtain a practical solution, and thus 
might be considered a compromise. 

In view of both a comparison of the formulas and the results of the 
empirical illustration presented, it is tempting to conclude that in many 
instances it will make little difference which method is used. This is particu- 
larly true with respect to measures of set where the intercorrelations of 
the methods vary from .94 to .99. However, McCornack [5] has pointed out 
the danger of assuming that just because two keys correlate highly they will 
have the same external validity. He shows, for example, that given two keys 
which correlate .94 with one another, when one correlates .60 with an external 
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TABLE I 


Intercorrelation Among Various Set and Content Components of a Test Score” 





























N = 62 
Content Scores Set Scores 
Pro- | Total| “ar Postu-|Corre= | Simpli-|Simple | War- | Postu-| Postu-|Simpli- 
Kens 1 Scovel Tete lated |lation| fied |Differ-| ranted] lated | lated | fied 
Set Scale Ratio ence Set Knowl-| Scale Ratio 
edge 
a Cy 5» 1% S, 1 %s Sy 
Cy 88 ooo 057 30 032 39 
Cy 92 03 38 07 07 .15. 
Ce. 87 -.07 }=-.01 -.33 | -.34 -.21 
Ce 78 ol? eae -.03 | -.01 205 
Co -05 me -.18 | -.17 -.10 
Sy -97 -94 | .95 .99 
. eo 94] .94 | .95 








s, 05 “3 98 | .97 
S. -96 he 97 
Sq 


#For 60 degrees of freedom, the 5% and 1% values of r are .250 and .325 














respectively. 


criterion the other might correlate anywhere between .29 and .84 with that 
same criterion. It is extremely important, therefore, to make use, if it is 
at all feasible, of an external criterion in selecting a method for obtaining 
separate set and content components of a test score. 

When no external criterion is available, other considerations become 
important. Thus, for example, it might be noted that in the present illustration 
the simplified ratio procedure has the highest first centroid loading in the 
matrix of set scores, has the next to the highest loading in the matrix of 
content scores, and has an average correlation between content and set 
scores which is closer to zero than that for any other method. That this 
occurred in a case which would clearly be more appropriate for the continuity 
model is particularly encouraging since use of the continuity model requires 
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considerable extra work in the scaling of the items and in the computation 
of the scores. 

Obviously, such evidence as that presented here does not conclusively 
establish the efficacy of the content and set scores described in this paper. 
To do this it would be necessary to carry out studies which indicated whether 
or not the set can be independently manipulated through the use of various 
experimental controls. For example, one might design an experiment which 
would involve the administration of a test such as that of alternative expres- 
sions under two or more conditions that differ in the extent to which set 
is likely to occur. Or, one might try writing tests which would yield the 
same content scores but different set scores when different item forms 
were used. Beyond this, the usefulness of such scores would have to be 
established by the usual reliability and validity studies in the context of 
a particular applied situation. 

This evidence does suggest, however, that one or more of the procedures 
developed here might be useful in a number of situations. In the absence 
of an external criterion against which to compare the methods and without 
further experimental evidence, present indications are that the simplified 
ratio procedure will provide an adequate approximation to set and content 
components of a test score except when both extreme accuracy is needed and 
when, in addition, the continuity model is obviously the most appropriate. 
In this latter case, use of the correlation procedure to obtain content scores 


would appear to be justified. 
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ITEM SELECTION METHODS FOR INCREASING TEST 
HOMOGENEITY 


HaroLp WEBSTER 


VASSAR COLLEGE 


A number of methods for increasing test homogeneity by item selection 
are discussed. Exact selection conditions which will maximize obtained homo- 
geneity as measured by KR - 20 and KR - 21 are derived, and an application 
is given. Since they require only item count data, the selection conditions are 
economical to apply. 


A problem which is likely to arise whenever psychological tests are 
constructed is that of adding or discarding items in such a way that the 
resulting test will have some optimum degree of homogeneity or ‘‘split-test’’ 
reliability. Adkins [1] and Davis [4] have reviewed various practical solutions 
in general use. A popular method consists in retaining items with high item- 
test correlations and discarding those with low correlations, but this is not 
the best way to increase obtained homogeneity as measured by the reliability 
coefficients in common use, KR-20 and KR-21, originally derived by Kuder 
and Richardson [10]. The purpose of the present paper is to present exact 
and economical item selection methods for increasing Kuder-Richardson 
reliability. 

There has been some debate concerning the adequacy of KR-type 
coefficients as measures of test homogeneity [11, 9]. In this paper homogeneity 
will be defined either as KR-20 or else as a rather general coefficient due to 
Lord [13], which is formally the same as KR-21; which coefficient is better 
to use in the item selection conditions probably depends, to judge from 
applications, on the sampling theory presented by Lord [14]. No attempt 
will be made to improve these definitions of homogeneity even though 
inadequacies are recognized, some of which are mentioned in the next para- 
graph. 

Although the item selection conditions which will be derived could be 
used to maximize homogeneity, there are important reasons why maximizing 
homogeneity for a given sample will usually be impractical. First, for a given 
sample small increases in homogeneity, especially in its upper range, are 
likely to be lost in subsequent samples because of unknown sampling varia- 
tions. Second, although cases where homogeneity is impractically high are 
seldom encountered, it is known (when items are assigned point scores) 
that tests with homogeneity approaching unity will have undesirable item 
redundancy [7, 16]. Finally, when using items for which the direction of 
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scoring is fixed, increasing homogeneity beyond a certain value tends rapidly 
to increase the proportion of items retained which have extreme means, 
which in turn increases skewness of the test distribution. For these reasons, 
homogeneity somewhat less than the maximum seems practical. 

There are unsolved statistical problems, including the sampling behavior 
of KR-20 and KR-21, which will not be considered here; these problems 
are especially serious if it is necessary to increase homogeneity for a number 
of subtests simultaneously. Investigators who are primarily interested in 
obtaining group-factor subtests from numerous items have available the 
methods of Wherry and Winer [20] and of Loevinger, Gleser and DuBois [12]. 
The present paper considers the more limited problem of increasing by item 
selection the observed homogeneity of a single test. 

Gulliksen ([8], p. 379) has suggested a graphical method for selecting 
for retention in a test those items which increase homogeneity as measured 
by KR-20. Discarded items are those for which the ratio of item variance to 
reliability index (item-test covariance divided by test standard deviation) 
is relatively large. Iterative conditions are derived in the present paper which 
use KR-20 or KR-21, which allow re-examination later of previously rejected 
items to see if they should be put back in the test, and which do not require 
plotting. 

In some problems the level of precision could be increased slightly if 
interitem relationships were utilized directly in assessing increases in homo- 
geneity, but usually such a refinement does not justify the additional 
computational labor required. All methods discussed in the present paper 
require, in addition to item means and variances, only item-test relationships 
and consequently they may be applied using only item count data. 


Item Selection Conditions Which Are Independent of Test Length 


In this and the next section several item selection methods for increasing 
test homogeneity, including the popular one mentioned in the first paragraph, 
will be considered. Methods which are independent of test length are dis- 
cussed first, since it can be shown that they do not always result in increased 
homogeneity. In the next section some item selection conditions are derived 
which are dependent on test length and which necessarily increase KR-20 
and KR-21. 

The usual definition of reliability is 


(1) ter = (Vr — Ex)/Vr = 1 — (Er/V2), 


where V7 is the total variance and £, is the error variance of test 7’. Either 
KR-20 or KR-21, which are used to define homogeneity, may be obtained 
directly from (1), depending upon how £, is defined. In addition to selecting 
(or discarding) items in such a way that the entire ratio (1) is increased, 
there are other ways in which r77 might be increased: by selecting those 
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items which (z) increase the true variance V; — E, , or (tz) decrease Ey , or 
(itz) increase V . Conditions derived to achieve any one of these aims alone 
appear to have serious limitations when used in iterations for the purpose 
of increasing rrr . For example, it was found in several applications that no 
single item existed which if discarded or added would decrease EF . Also, 
methods (z) and (27¢) are inefficient; because of the relative stability of EZ; , 
method (7) appears to have much the same disadvantages as (277), the in- 
efficiency of which will next be shown. 
The variance of test 7’ minus item j may be written 


(2) Vr-; — Vr + V; nr 2Cir ’ 


where C';7 is the covariance of item j and test 7’. If item j satisfies the con- 
dition Vr < V,r_; , then it could be discarded to increase the test variance, 
and this is seen, using (2), to be the same as discarding j if 


(3) Cir < V;/2. 


Dividing both sides of (3) by the product of standard deviations, S;Sr , 
j is discarded if 


(4) ‘jr < S,;/28,r ° 


By a similar development adding an item k not already in T will increase 
the variance if 


(5) rer > —S,/2S8r7. 


Now item selection based on alternate applications of (4) and (5) will always 
increase test variance. The quantities on the right in (4) and (5) are, however, 
quite small, and as the standard deviation S7 increases, these expressions 
approach even more closely the condition that items be discarded or added 
merely if their test correlations are, respectively, negative or positive. But 
the latter condition is known to be an inefficient method for increasing 
homogeneity, for it can be seen that one effect of applying (4) and (5), if 
enough items are available, will be to form a very long test. It has also been 
shown by other methods that if a test is long enough, practically all items 
with item-test correlations exceeding zero will, if added to the test, contribute 
to its homogeneity [3, 17]. Items with low correlations may contribute very 
little, however, and efficiency in testing requires that the shortest tests 
which achieve a specified homogeneity be used. Bedell [2] derived equations 
which could be solved for the number of items with lowest item-test correla- 
tions to be discarded in order to maximize the reliability of a single-factor test. 
Unit rank for the item matrix was assumed, and some computational approxi- 
mations were developed. 
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A popular method is to discard item j from test T if 
(6) tir <k, 
where k is a positive constant. The requirement that retained items be signifi- 
santly correlated with their own test is approximately satisfied if a number 
of items satisfying (6) are discarded at once when k is some multiple of the 
standard error of r;, . But if moderately large samples of subjects and items 
are available, and k is chosen to correspond to one of the usual levels of 
significance, then the obtained homogeneity is usually found to be decreased 
after applying (6). 

Another method which is independent of test length will next be dis- 
cussed briefly. If item j satisfies the inequality 
(7) irr <0 r-j)(T-i) » 
where the r’s are homogeneity coefficients for test 7’ and for test 7 minus 
item j, respectively, then discarding j will increase homogeneity. A similar 
expression can be written for the case where adding an item to 7’ increases 
homogeneity. But whether or not j satisfies (7), or the corresponding addition 
condition, could depend, especially for short tests, on the length of 7. It 
might therefore be argued that if items were to remain in 7 on their own 
merits, (7) should be rewritten so that it is independent of the length of 7’. 
It can be shown, however, that this is not advantageous, for it leads to 
expressions the application of which will eventually decrease homogeneity. 
To show this, first multiply the left side of (7) by (n — 1)/(n — rrr). This 
is the change required to make (7) independent of test length, a fact which 
can be proved by next rearranging the terms to correspond to functions 
which are known to be invariant with respect to test length ((8], p. 85). 
Finally members of this rearranged inequality can be shown (by adding 1 
to both sides and taking reciprocals) to have exactly the same effect in item 
selection as if they were estimates of the squared average item-test correlation. 
Therefore if (7) is directly altered to make it independent of test length, its 
application will always increase the average item-test correlation, but will 
not always increase homogeneity. In fact if it is used in iterations, it will 
reject successive halves (approximately) of the items in the test, and the 
analogous addition condition will not admit back into the test later any 
items previously rejected; consequently r77 decreases sharply after the 
test is shortened beyond a certain point. 


Item Selection Conditions Which Are Dependent on Test Length 


We return therefore to (7) as the condition for discarding item j in 
order to increase test homogeneity. As in (1), the homogeneity of the shortened 


test is 


(8) tir-iy(r-iy = 1 — (Er-;/V 7-5). 
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Lord [13, 14] has shown that if F is defined as the mean of the estimated 
sampling variances (based on random samples of n items) of the N subjects’ 
test scores 7; , then 

(9) E, = (nT — T — V;z)/(n — 1), 

where 7’ is the sample mean. Substitution of (9) in (1) provides a measure of 
reliability which is formally identical with KR-21 but which is actually 
more general than either the latter or KR-20. The error variance of the short- 


ened test can be written, as in (9), 
go) Bri = Mtn — WP = p) — P= p) - VeVi - 2) 
= [nT — T? — Vr — T — p(n — 27) + 2Cjr]/(n — 2), 


where p; is the mean of item j. 

Substituting (1), (8), (2), (9), and (10) in (7) and simplifying, it is found 
that item j may be discarded to increase homogeneity (as measured either 
by Lord’s formula or by KR-21) if 
(11) Cir —s kV; — kp; < eg ; 
where the constants in (11) are 

k, = (n — 2)(nT — T? — V7)/2[(n — 2)(nT — T’) + Voi, 
k, = (n — 1)(n — 27)V7/2[(n — 2)(nT — T’) + Voi, 
ks = VAT? — T + Vr)/2[(n — 2)(nT — T’) + Vo). 


Suppose that item k is not in test 7’. By a derivation analogous to the 
above, item k may be added to T to increase homogeneity if 


(12) Cir + KiV, — Kop, > Kz , 


where the constants are 
K, = n(n? — T — V_)/2(n?F — nT — V2), 
K, = (n — 1)(n — 27) V_7/2(n’T — nT’ — V2), 
K, = VAT? — T + Vz)/2n'? — nT — V>). 


II 


It is interesting that the item mean p remains explicit in (11) and (12). 
This is not so for conditions derived using KR-20. One can discard j to 
increase homogeneity as measured by KR-20 if 


n_[Vr- > Ye] n—1 =» Vi + Yi] 
(13) n—1 | Ve is n—2 Vr-; , 
which, using (2), can be simplified to 
(14) Cit = h,V; < hg ’ 
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the constants being 
hy = {[(m — 1)? + 1]Vr + nm — 2) 2) Vi} /2[Vr + nn — 2) DO Vid; 
he = Ve(Vr — >, Vi)/2[Vr + nfm — 2) >> Vi). 

Similarly, adding item k to 7 will increase KR-20 if 

(15) eo 7 A a 


where 
n(Vr er z V,)/2(n’ > V; — Vn), 
VAVr — > V)/2n? > VV; — Vo). 


Applications of (11), (12), (14), and (15) 


Computations of the item selection conditions to increase either KR-21 
or KR-20 are not as laborious as they may first appear. By grouping the 
test distribution into five symmetrical categories as recommended by Flanagan 
[5], and obtaining item counts for the four extreme categories, calculation 
of the test variances and item-test covariances required in the conditions 
can be carried out rapidly. 

It is known that if the 7 distribution is grouped into five categories 
containing percentages of scores (from high to low), 9, 19, 44, 19 and 9, 
which are assigned new scores, 2, 1, 0, —1 and —2, respectively, then the 
grouping will not only have high efficiency, but will also incorporate an 
adjustment for the effect of the coarse grouping on the estimate of r;r , 
the item-test point-biserial correlation [18]. The new covariances will be, for 
a sample of N subjects, 


(16) jr = (2e + f — g — 2h)/N = Djr/N. 


In (16) e, f, g, and h are frequencies of a preferred response to item j for 
the extreme categories for which the scores are 2, 1, —1, and —2, respectively. 
It can then be shown that the covariances needed in (11), (12), (14), and (15) 
are, to a sufficiently close approximation, 


(17) Cir = Djr Dd) Dir/N’, 


where the summation is over the n weighted differences, corresponding to 
the n items, as defined in (16). Also an estimate of the variance required for 
computing the constants of the item selection conditions is 


(18) Vr = iz: Dy1)’/N’. 


As an example, Table 1 presents data which show how the item selection 
conditions increased homogeneity for a sample of 100 college women. The 
test used was the De scale [6] from the California Psychological Inventory, 


A, 
H, 
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TABLE 1 


Variations in Homogeneity Due to Selecting Items 
so as to Increase KR-21 and KR-20 











n KR-21 KR-20 x VT Discard Add 
54 - 562 ° 675 17. 85 26. 63 15 0 
39 . 677 <taa 11,47 23.81 7 1 
33 . 708 Pai | 8. 33 19, 89 8 1 
26 - 725 . 764 6.15 15,52 - - 
39 -677 «tar 11. 47 23.81 7 0 
32 . 685 » 770 9. 38 8 erg | 5 4 
33 . 693 . 780 8. 46 18. 66 - ~ 





a personality test known to discriminate delinquent from nondelinquent 
persons in numerous samples. In a number of previous samples, KR-21 for 
the complete scale of 54 items has been found to fall in the range .50 to .60. 

The first four rows of Table 1 show how KR-20 and KR-21 varied 
when items were alternately discarded and added back in iterations using 
conditions (11) and (12). The last three rows of Table 1 show variations in 
these same coefficients when (14) and (15) were applied starting with the 
39-item test of the second row. 


Discussion 


The method is not very time consuming, and with the help of (17) 
and (18) can easily be applied using item count data. Since there were only 
100 papers, the data of Table 1 required the time of one person for two 
days; however, this sample was used only for this example, and a much 
larger sample would ordinarily be preferable. It is likely that when there 
is only a single sample available, the greater the number of iterations employed, 
the larger N should be in order to allow for the increased use of variations 
peculiar to the one sample. 

From Table 1 it can be seen that the ratio of variance to mean increases, 
for either method, with the number of iterations. This is an indication of 
the increasing skewness already mentioned. If the scoring for every item 
were reversed, so that the test became one of nondelinquency, an examination 
of the selection conditions shows that the skewness would still increase, but 
in the opposite direction. Also in (11) the greater the skewness, the more 
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n — 27, in the constant k, , differs from zero, thus assigning increasing weight 
to the item means. 

In Table 1, KR-20 necessarily exceeds KR-21 even when it is KR-21 
that is increased; however, when conditions for increasing KR-20 are applied 
(last three rows of Table 1), KR-21 increases rather slowly so that the differ- 
ences between these two measures in the last two rows is larger than the 
difference usually found at this reliability level. Lord [14] shows that un- 
reliability due to variations of obtained means about the true mean is included 
in the estimate provided by KR-21, but is not for KR-20. In the iterative 
process of which Table | is an example, the true mean also must vary because 
the test length changes. This would seem to imply that neither reliability 
measure could be ideally coordinated with the underlying stochastic process. 
Because of the differences arising between the two coefficients in Table 1, 
however, it is likely that use of the conditions based on KR-21 will produce 
reliability which holds up better in subsequent samples. It is recommended, 
therefore, that (11) and (12), rather than the KR-20 conditions, be used, 
preferably with a new sample of subjects every time a new form of the test 
is scored. Even if a succession of samples is not available, one or two iterations 
using (11) and (12) with a large sample would seem preferable to other 
methods which have been considered for increasing homogeneity. 
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